Introduction 

Trichoderma reesei is a powerful platform to produce a cocktail of native enzymes, mainly used not only in the paper, textile, and feed industries but also in the second-generation bioethanol process. More recently, its high secretion capacity, ease of cultivation, and classification as a Generally Recognized as Safe (GRAS) microorganism make it a suitable choice for the production of heterologous proteins. These heterologous proteins can be either fungal or bacterial enzymes to modify the composition of the original cocktails (Sun et al. 2018; Arai et al. 2023; Wohlschlager et al. 2021) or high-value pharmaceutical proteins (Landowski et al. 2016; Jiang et al. 2019; Arai et al. 2023). Currently, the improvement of heterologous protein expression in T. reesei is based on upgraded strains, the development of promoters, and construction of fusion protein (Nevalainen and Peterson 2014; Singh et al. 2015). However, productivity and titers remain a major challenge for the industrialization of these processes.

One way to increase productivity is to integrate multiple copies of the gene of interest. This approach has been mainly explored in yeasts fungus such as Saccharomyces cerevisiae (Sakai et al. 1991; Qi et al. 2022), Yarrowia lipolytica (Novikova et al. 2021), Pichia pastoris (Deng et al. 2002), Kluyveromyces lactis (Oguro et al. 2015), and Lypomyces starkeyi (Sakai et al. 1990). Most of these multi-integration systems are based on targeting sequences present in multiple copies in the genome (Zheng et al. 2022), on the approach of iterative transformations with marker recycling (Jensen et al. 2014), on the use of a modified selective marker to screen transformants displaying multicopy integration (Semkiv et al. 2016), or on the introduction of genetic construction with multiple copies of the gene (Zheng et al. 2014).

Cases of unintentional multicopy integration have been described in filamentous fungi, but only a few examples of on-purpose approaches have been implemented (Takeda et al. 2018; Plüddemann and van Zyl 2003). Araki et al. (2021) have developed a multicopy integration strategy in Aspergillus sojae that relies on the construction of an attenuated selection marker, claiming 18 to 46 integrated copies depending on the target gene. In T. reesei, increased β-carotene production has been achieved by iterative integration of genes encoding key enzymes of β-carotene synthesis, but the copy number is below three and the impact on productivity remains limited (Li et al. 2023).

In a previous work, a flow cytometry experiment was performed to measure the fluorescence of germinating conidia originating from transformants expressing an enhanced green fluorescent protein eGFP under the control of various promoters ectopically integrated into the genome of the T. reesei strain Cl847 (Mathis et al. 2020). Among the nine screened promoters, transformants with the tef1 promoter (Nakari-Setälä and Penttilä 1995) exhibited the highest egfp expression level with the fastest kinetic of fluorescence production. Interestingly, a wide range of egfp gene expression levels was also observed in these transformants. Therefore, the tef1 promoter appeared as the only one compatible with cytometry, meaning an early and strong expression of fluorescence during the first steps of germination. We therefore decided to probe the transgene integration sites for these transformants to identify a variety of sites enabling the modulation of the expression levels of a given target gene. A similar strategy was previously applied by Qin et al. (2018) with the random integration of an expression cassette comprising genes encoding a lipase and the fluorescent protein DsRed1 in T. reesei, but only two strains with the highest expression of both transgenes were investigated to identify the integration sites.

Strikingly, our analysis revealed a single site and with multicopy insertion of the expression plasmid in all strains analyzed. The genomic insertion site is localized upstream of the tef1 locus with a copy number ranging from 5 to 11. This result suggests that the use of the tef1 promoter may favor multicopy insertion and therefore promote a higher expression for heterologous genes.

Materials and methods

Strains and culture conditions

The T. reesei hyperproducer mutant Cl847 (Durand et al. 1988) used in this study was obtained from several rounds of mutagenesis of the RutC30 strain (ATCC 56765). Cl847 is available at the Collection Nationale de Cultures de Microorganismes (CNCM) at the Institut Pasteur in Paris, France (CNCM MA 6-10W). Transformants were cultivated in liquid PD (potato dextrose) or solid PDA (potato dextrose agar) at 30 °C using frozen spores as inoculum. After transformation (Penttilä et al. 1987), three purification steps were carried out, (i.e., three specific propagation and conidiation steps), followed by cryopreservation (PD glycerol 15%) and another propagation and conidiation. This process, covering four propagations, allows us to check the stability of egfp expression over a long period of time.

For the germination time course, solutions of fresh spores or from cryotubes stored in PD plus glycerol 15% were propagated in PD medium at 25 °C, 125 rpm for 2 to 4 h, then at 15 °C in static overnight, before returning at 25 °C for 6 h (Mathis et al. 2020). For each transformant, three independent sub-cultures of these conidial suspensions were grown in a PD medium for germination to carry out a flow cytometry experiment.

Flow cytometry acquisitions

All analyses were conducted using a Cyflow Space cytometer, equipped with an MLS Blue 480–50 V2 laser emitting at 488 nm with a power of 50 mW (SYSMEX, Kobe, Japan). The optical detector range for FL1 (green fluorescence channel) was 520/20. Acquisition conditions were described by Mathis et al. (2020). All assays in the FCM (flow cytometry) and quantifications were performed as technical triplicates on independent cultures (biological triplicates). FCM parameters were as follows: speed 6, FSC (forward scatter) 125, log3, SSC (side scatter) 200 log3, FL1 450 log4. Once defined, the cytometer settings were maintained throughout the study and for all samples. The use of controls (calibrated beads for size or non-fluorescent controls for gating) ensures that fluorescence and size responses are repeatable and comparable between samples. Fluorescence measurements are provided by the instrument after converting the number of photons captured into an electrical signal that can be interpreted by electronics.

Genome sequencing, sequence alignment, and analyses

Genomic DNA for next-generation sequencing (NGS) was extracted according to the protocol of the Joint Genome Institute (https://1000.fungalgenomes.org/documents/Martin_genomicDNAextraction_AK051010.pdf). Library preparation and Illumina sequencing in 2 × 150 pb paired-end read mode were performed by Eurofins MGW (https://eurofinsgenomics.eu/) for the 8 transformants. Reads were paired and trimmed using Geneious Prime® (Biomatters Ltd, Auckland, New Zealand), and their quality was assessed by FastQC analysis performed on the Galaxy platform (The Galaxy Community 2024). The Cl847 genome is not available in literature, contrary to RutC30 (ATCC 56765), its closest relative. Mapping was achieved using the Geneious mapper on a RutC30 genome which was reconstructed in silico from QM6a (Li et al. 2017) by inserting previously identified chromosomal rearrangements and mutations (Vitikainen et al. 2010; Le Crom et al. 2009; Koike et al. 2013). Whole genome alignments were performed using Geneious Prime (Kearse et al. 2012) with the Geneious mapper (Geneious Prime®, Biomatters Ltd) configured for “medium/low sensitivity” and “deletion and structural variation” settings. Average coverage ranges from 46 and 61, and pairwise identity is superior to 99% for all strains.

Four strains (E1, E3; E4, and E5) were also sequenced using long-read sequencing technologies. DNA libraries for long-read sequencing were prepared with the Ligation kits LSK109 and NBD104 (Oxford Nanopore Technologies, Oxford, UK) according to the manufacturer’s recommendations. The library was sequenced on a flo-min106 (R9.4.1) flowcell on a GridION instrument (Oxford Nanopore Technologies, Oxford, UK). Basecalling, demultiplexing, and trimming have been done with Guppy3.2.6 (Oxford Nanopore Technologies, Oxford, UK). Approximately 200,000 reads giving a total base number of 500 million were obtained per library with an average read length between 10 and 12.5 kb. After the quality filter, around 96% of the reads were kept for the following analysis. Sequencing data from Illumina and GridION platforms are available at NCBI at Bioproject number PRJNA1032401.

Results

Kinetics of fluorescence during germination in independent transformants

In a previous work, transformants of the Cl847 strain obtained by ectopic integration of an egfp under the control of the tef1 promoter (Nakari-Setälä and Penttilä 1995) and the cbh1 terminator showed a high level of fluorescence in conidia increasing with germination time. Cl847 was thus kept as the strain of interest in the present study, also due to its high relevance for industrial purposes, in order to maximize the technological readiness level (TRL) of the results. We decided to take advantage of this feature to assess the variability among 9 independent transformants (E1 to E9) compared to the reference strain Cl847 by following the fluorescence kinetics from the time point 0 (T0h) to 24 h (T24h) in flow cytometry (FCM). The gating parameters allowing a relevant discrimination between fluorescent and auto-fluorescent spores have been already established by Mathis et al. (2020). Gating regions and cytograms at 0, 16, and 24 h are shown in Supplementary data, Fig. S1. The behavior of the CL847 control strain has been previously described in Mathis et al. (2020), i.e., a profile with an increase in the size of the events associated with the swelling of the conidia due to the germination process (FCS size axis) correlated with a slight increase in autofluorescence (FL1 axis).

All transformants displayed a pattern similar to the control strain over time for the FCS axis. In contrast, the conidial cloud is already distributed between the non-fluorescent (S) and fluorescent spores (F) with 43 to 74% of the total population present in the F gate depending on the transformants, with E5 (74%) and E2 (67%) having the highest number of spores in F at T0h. At T16h and T24h, the cytograms highlight a general increase in fluorescence for all transformants, but different profiles of homogeneity, spread, and intensity are observed. Most of the transformants (E1, E2, E4-E6, E8, E9) have a homogeneous population that is almost entirely detected in the F acquisition region. On the contrary, the spore population of E3 and E7 splits in two during the kinetic for some transformants with a plume in the AF region for E3 and the detection of two equivalent populations (50/50) present from T0h for E7. The heterogeneity observed in E7 suggests that it is a heterokaryotic strain. The genetically different nuclei of E7 probably originate from independent integration events leading to different levels of egfp expression. In the following analysis, this transformant was eliminated.

Fluorescence quantification over time highlights differences between transformants

Kinetic fluorescence quantification was performed according to Mathis et al. (2020), measuring the ratio of fluorescence intensity to spore size (F/S) using the events in the SF gating region (Fig. 1).

Fig. 1
figure 1

Evolution of ratio of fluorescence to spore size over time for pTEF-EGFP transformants E1 to E9 (excluding E7). Comparison with CL847 controls. Events in the SF gating region were used for measurement. Averages obtained based on triplicates

As expected, Cl847 exhibits a low and constant basal autofluorescence (> 0.1) throughout the kinetic confirming that the fluorescence intensity of the transformants is due to egfp expression. At T0h, the F/S ratio of the transformants and the control strain is similar whereas a significant signal difference from the background is observed for three strains (E2, E4, and E9) at t + 2 h and for all strains at t + 16 h. Interestingly, at t + 24 h, various sets of ratio levels are detected depending on the strain, with FS values ranging from 2.8 (E3) to 6.9 (for E4), i.e., a factor of 2.5. The rank of transformants in respect of size-normalized fluorescence emission at t + 24 h is E3 < E5 and E9 < E1 and E8 < E2 and E6 < E4.

The variability between transformants is probably related to the integration sites of the expression plasmid in the genome of the strains or to the integrated copy number of the egfp. To validate this assumption, we decided to sequence the genome of the 8 transformants using the Illumina short paired-ends reads technology.

Fluorescence quantification differences seem to be linked to the expression cassette copy number

The Cl847 mutant was obtained from the RutC30 strain by six steps of mutagenesis (Durand et al. 1988). As the RutC30 genome has been previously sequenced (Koike et al. 2013; Le Crom et al. 2009; Jourdier et al. 2017), this genome was used as the reference for all alignment procedures. One of the challenges in identifying the insertion site within the genome is the presence of the tef1 gene in both the genomic DNA and the plasmid. Initially, a whole genome alignment was conducted on the RutC30 genome using the Geneious mapper, a tool capable of detecting structural rearrangements. However, this approach did not reveal any insertion sites. To advance this investigation, a strategy was developed to pinpoint the integration sites through the collection of hybrid paired reads, following the methodology previously employed by Takeda et al. (2018). Paired reads could be classified as follows (Fig. 2A): reads mapping both to the genome (PR1) or to the plasmid (PR2), one read mapped to the plasmid and the other one to the genome (hybrid paired reads, PR3), and those where one mapping to either the plasmid or the genome and the second one mapped nowhere (PR4). The PR4 reads include the chimeric reads, i.e., those that straddle the genome and the plasmid. To recover the PR3 group (Fig. 2B), a first alignment was performed on the genome with and the unused paired reads were collected to be mapped on the plasmid. In both steps, only paired reads that were both mapped were aligned. The unused paired reads generated from this analysis were expected to contain mainly hybrid paired reads. Finally, the PR3 group was aligned to the plasmid without restriction in the mapping of both reads of a pair, and a de novo assembly was performed with the unused reads unpaired. Contigs that were not AT-rich or homopolymers were mapped to the genome, but no insertion sites could be identified using this pipeline.

Fig. 2
figure 2

Strategy for identifying the integration sites of the plasmid. A Classification of paired-end reads localization according to their localization in the transformant genomes. PR1, reads mapping to the genome; PR2, reads mapping to the plasmid; PR3 or hybrid reads, one read mapping to the plasmid and the other one to the genome; PR4, reads with one mapping to either the plasmid or the genome whereas the second one mapped nowhere. B Pipeline to recover the hybrid paired reads and localize plasmid integration site

Interestingly, mapping the reads onto the plasmid revealed significantly higher coverage compared to the genome, suggesting the possibility of multicopy integration. To ascertain the number of copies present within the genome, an alignment of all reads was performed against the plasmid, and an estimate of the copy number of hph, ptef1, and egfp elements was obtained by comparing their coverage with the genome (Table 1). The egfp copy number is mostly in agreement with the fluorescence quantification with E2, E4, and E6 (6.1, 9.8, and 7.1) having the highest copy number, whereas E5, E8, and E9 have the lowest (4.8, 3.7, and 3.5). In contrast, no correlation was observed between fluorescence measurements and copy number for strains E1 and E3. This result could be explained by recombination mechanisms during plasmid integration. This hypothesis is supported by the high variability in the coverage of certain elements (263 to 424 for the tef1 promoter in strain E1) and by the variable number of copies of the plasmid elements (5.3 copies of the tef1 promoter for 6.9 copies of egfp in strain E3). Additionally, a linear regression model between the GFP expression levels from Fig. 1 and the egfp gene copy number from Table 1 shows a general trend of correlation; however, the adjusted R2 remains moderate (0.48), and the model is barely significant (p-value = 0.0581). Nonetheless, it is the E3 strain that singularly jeopardizes the correlation. Indeed, if the same model is constructed without E3, the trend becomes highly significant (p-value = 0.0023), and the adjusted R2 is much improved (0.86). According to Fig. 3, E3 is the strain for which the plasmid integration event was the most random, with numerous recombination phases, as the entire plasmid is found only once, unlike E1, for instance. We can therefore hypothesize that despite a good coverage rate, there is little functional GFP in E3. Nevertheless, a positive linear correlation can still be posited between the number of egfp gene copies obtained through sequencing and the protein expression measured by flow cytometry.

Table 1 The average coverage for the genome, the promoter tef1 (ptef1) and the egfp and hph genes obtained with the Illumina short-read platform. The copy number is an evaluation based on the genome coverage
Fig. 3
figure 3

Annotation of different expression plasmid elements inserted into the genome of transformed strains. A Plasmid vector for transformation and native chromosome. B Multicopy integrations as observed for 4 transformants. Abbreviations: P promoter, T terminator, E egfp, H hygromycin; green arrows indicate functional copies of egfp; black arrows indicate dynamics of insertion of the circular plasmid into the genome

Multicopy integration, which may have occurred either at one site or at multiple localizations within the genome, could explain the reason for the failure to identify the integration sites using short-read sequencing platform.

Identification of a unique integration site in all strains with long-read sequencing

In a second attempt to identify the integration site, a long-read sequencing (LRS) was performed using Oxford Nanopore Technology (ONT). We hypothesized that long reads might overlap the inserted plasmid and the flanking regions. There is no sequence homology in the vector as significant as that of the tef1 promoter. All other regions (promoter, terminator, ori, marker, etc.) comprising the vector do not originate from T. reesei, except for the cbh1 terminator located downstream of egfp. Among the transformants studied, none exhibit insertion at the cbh1 locus. The scientific literature concurs that for precise locus insertion, the sequence must be flanked by 1 kb of homology on both sides (Ma et al. 2023). In the absence of CRISPR-like tools or strains deficient in homologous repair systems, locus insertion occurs in 1 out of 10 cases (Schuster et al. 2012). In our study, this insertion frequency is 8 out of 8. The egfp containing plasmid consistently integrated upstream of the tef1 promoter, even without two homologous flanks.

Four (E1, E3, E4, and E5) of the eight transformants were selected as representative of the range of egfp expression for this experiment. To identify the overlapping reads, a double mapping to plasmid and genome was performed, and the chimeric reads plasmid/genome were recovered. As the plasmid contained sequences also present in the genome (cbh1 terminator and tef1 promoter), a filtering pipeline of the chimeric reads is necessary to select the truly overlapping ones which results in about 50 reads per strain with a minimum read size range of 1150 to 4230 and a maximum of 67,240 to 109,977 (Fasta files of this analysis available in Supplementary Data File 1).

As one of the disadvantages of ONT is the high error rate, especially at the ends of the reads (van Dijk et al. 2023), it may be inefficient to analyze the sequences by searching for nucleotide similarity. Therefore, to reconstruct the insertion event, a functional annotation of the selected reads was carried out using the “annotate from” tool of the Geneious software with a similarity parameter of 60% and the plasmid sequence and the genome as annotation sources.

As previously suggested by the Illumina experiment analysis, we detected a multicopy insertion of the plasmid. Surprisingly, the insertion site was identical in all four transformants and was localized upstream of the tef1 gene. Unfortunately, none of the selected reads included both the flanking regions and the multiple plasmid insertions. Nevertheless, a detailed study of the reads allowed us to reconstruct the inserts (Fig. 3). For the four strains, at the 3′ border, the tef1 gene is always associated with a tef1 promoter of the same size as the plasmid, and at 5′ border, the first annotated plasmidic region at the junction with the genome is a tef1 promoter. We cannot assign the plasmidic or genomic source of these two tef1 promoters. Nevertheless, a tef1 promoter sequence of the same size as the plasmid one is probably essential for the viability of the cells since neither of them is found truncated upstream of tef1.

The reconstructed insertion events highlight two types of process: a cyclic multicopy insertion of the plasmid with (E3, E4, and E5) or without (E1) recombination. In addition, some copies of the expression copy are truncated, resulting in partial or no expression of the egfp (E3 and E4). We, therefore, infer the copy number of the expression cassette by considering only the complete copy with promoter and terminator. As expected, the intact copy number correlates with the egfp expression level detected in the FCM experiment with 8 copies for the E4, 7 for the E1, 5 for the E5, and 4 for the E3.

Discussion

In this paper, we describe the finding of the insertion site of an uncut plasmid containing an egfp expression cassette with the tef1 promoter and the cbh1 terminator, in the genome of eight transformed strains. Strikingly, the localization of the circular plasmid was identical in all strains, upstream of the tef1 gene. This integration site bias may be explained by the construction of an expression cassette with the tef1 promoter without the tef1 terminator, which would favor homologous recombination in the promoter region. In T. reesei, only a single copy of the gene is present in the genome (Nakari et al. 1993). Although no tef1 knockout experiment has been reported in T. reesei, we could infer from other fungal species (Cottrelle et al. 1985; Silar et al. 2000) that inactivation of tef1 is probably lethal to the cells. Therefore, we can assume that only strains with an insertion that conserved an intact tef1 gene and promoter were able to survive. It should be noted that the transformants examined in this study were not selected for a high fluorescence phenotype, but were randomly selected, suggesting that multicopy and tef1 upstream insertion should occur in the majority of strains. Other teams have reported the use of a similar expression cassette, namely a tef1 promotor with a cbh1 terminator (Uzbas et al. 2012; Nakari-Setälä and Penttilä 1995; Dashtban and Qin 2012), but none of them mentioned multicopy (> 2 copy) integration. These differences could be explained by the low number of transformants tested (Nakari-Setälä and Penttilä 1995), the choice to select transformants with only one copy (Uzbas et al. 2012), or the transgene sequence itself.

In this study, we have shown that an efficient system for multicopy integration can be achieved by using an expression cassette with the tef1 promoter without the tef1 terminator and by using a circular plasmid. As already mentioned by Nakari-Setälä and Penttilä (1995), producing the heterologous proteins on glucose prevents contaminating proteins, since the hydrolytic enzymes produced by T. reesei are mostly repressed in the presence of glucose. An important step for the production of heterologous proteins on an industrial scale could be the coupling of the tef1 promoter for constitutive expression on glucose and the multicopy integration reported here. Finally, the tef1 region seems to be a recombination hotspot. It shall be recalled that tef1 has proven to be an essential gene (Cottrelle et al. 1985; Silar et al. 2000), which is constitutively expressed (Nakari et al. 1993). Recently, a study has demonstrated that the chromatin state of promoters of essential genes is actively maintained as open to ensure their transcription (Fan et al. 2021). Thus, we may hypothesize that the chromatin state of tef1 promoter favors homologous recombination.