Upon heterologous overexpression of a protein, its secretion into the extracellular environment rather than accumulation inside the cell is a superior strategy. Considerable interest in a native extracellular proteome of a cell, resulted in development of a number of strategies facilitating determination of the pool of proteins creating the cell secretome, providing a great deal of tools and knowledge to be adopted in heterologous protein overexpression. Among the experimental methodologies designed for the secretome studies, one can name secretion traps or signal sequence traps, mass spectrometry, or Serial Analysis of Gene Expression (SAGE), as thoroughly discussed in Mukherjee and Mani (2013). A selection of computational methods enabling prediction of putative secretory proteins based on SP-encoding sequence occurrence in a given ORF was also developed. Advanced computational tools allow to predict the secretory proteins traversing via the classical pathway, like SignalP (Petersen et al. 2011), or those lacking a conventional SP, processed through a non-classical pathway, like SecretomeP (Bendtsen et al. 2004). In this study, we have adopted systematic analysis of Y. lipolytica strain CLIB122 complete genome (Dujon et al. 2004) for ORFs containing putative SP, inferred from a consensus structure of the two major secretory proteins of Y. lipolytica cells—AEP and LIP2. The SP-encoding regions of the selected proteins (Table 2) were analyzed in all the selected ORFs using SignalP (Petersen et al. 2011) and PrediSi (Hiller et al. 2004) tools to predict the extent of the leader domain and to assess probability of cleavage by signal peptidase (Table 2).
According to the literature data, e.g., Yang et al. (2006), Yarimizu et al. (2015)), a typical structure of a “Sec-type” yeast signal peptide covers (i) N-domain bearing at least one positively charged AA residue (R or K) and (ii) H-domain (hydrophobic core) built by a tract hydrophobic AA residues (e.g., A, L, V, F, C, Y, W, I, M) forming an alpha-helix, which is essential for translocation of the polypeptide through cellular membrane, terminated with (iii) C-domain: “helix-breaking” or polar (P, E, or G) residue, facilitating digestion through a specific signal peptidase, ended with a consensus sequence A-X-A (X-any residue), recognized by the specific signal peptidase. Although a general secondary structure of an operable SP was established, a strict consensus sequence has not been determined, as individual elements building an SP are highly variable in length and have no obvious sequence homology. Several comprehensive studies on the impact of individual site mutation in the N-terminal sequence on the secretory potential of the SP operating in a particular host system have been reported to date (Rakestraw et al. 2009; Viña-Gonzalez et al. 2015; Yarimizu et al. 2015). In the SPs under study, the first component (positively charged N-terminus) of a typical SP was identified in all the selected signal peptides but one (sp TlGAMY NATIVE), while the second determinant—hydrophobic alpha-helix—was found to be a common structural element, as the hydrophobic residues were abundant in the central region of all the peptides. The crucial importance of the N-terminal positively charged amino acid (K, but also operability of R, N, W, and F) and the following stretch of hydrophobic core in a SP was elegantly demonstrated in a comprehensive study by Yarimizu et al. on heterologous proteins expressed in Kluyveromyces marxanius cells (Yarimizu et al. 2015). In that paper, serial deletion of individual amino acids was conducted and the effect of a corresponding deletion was subsequently studied. It was concluded that the range of the crucial hydrophobic core was defined by the N-terminal basic and C-terminal non-hydrophobic amino acids (e.g., E or P). Clear marking of the boundaries was essential for the SP operability; however, the length of the hydrophobic core was different for each individual polypeptide, and optimal length of the hydrophobic helix could be determined for each of proteins analyzed in that study (having bacterial, fungal, and human origin). In this study, prediction of the cutting site in the SPs under study was first conducted for a given SP followed by its own native polypeptide sequence. The output SignalP D scores (column NATIVE sp–NATIVE pp in Table 2) can roughly reflect the secretory potential of the SP (D score is used to discriminate SPs from non-SP sequences by the secretion machinery with a given confidence). Based on these computational analyses, the engineered spLip2pre-3xLA SP (Ledesma-Amaro et al. 2015), equipped with a hydrophobic dipeptide stretch (-LALALA-), was identified as SP processed by signal peptidase with the highest confidence (D score 0.874), while its un-engineered counterpart (spLip2 native) was identified as SP with the weakest confidence out of the analyzed SPs (D score of 0.623). For all the remaining SPs equipped with their native polypeptides, the calculated D score values were between these two border values (Table 2).
With the advent of modular cloning strategies, like Gibson assembly, Golden Gate, or Gateway, the genetic engineering toolboxes have greatly expanded, offering much higher versatility, comprehensiveness, and high throughput character of the research. Adaptation of a given sequence as a “biobrick” to a modular cloning strategy requires fitting the sequence into a pre-designed scaffold. Since all of the in silico predicted sites recognized by a signal peptidase were preceded by an alanine residue encoded by, i.e., GCC codon, this motif was included in the novel x overhang. Analysis of the penultimate codon sequence allowed to state that T nucleotide in the first position of the 4-nt overhang is the optimal choice, as then the least number of SP amino acid sequences had to be changed. Nevertheless, in two cases (YALI0B03564g and spTlGAMY NATIVE), the amino acid residue directly prior to the terminal alanine had to be modified due to introduction of the T nucleotide at the third position of the penultimate codon (M ➔ I and E ➔ D, respectively; see Table 2). It was presumed that a change is admissible provided that the character of the encoded amino acid is maintained (hydrophobic ➔ hydrophobic and negatively charged ➔ negatively charged, respectively). Consequently, all the SP-encoding sequences were terminated with a TGCC overhang accompanied by a BsaI cutting site in an appropriate orientation to comply with the previously adopted standard. Such SPs were subsequently in silico equipped with the amino acid sequences of the two amylolytic enzymes under study (SoAMY or TlGAMY) devoid of their native signal sequences. Such hybrid sequences were again subjected to in silico prediction of the cutting site by signal peptidase (column MODIFIED sp-pp of interest; Table 2). In the case of the two modified SPs (YALI0B03564g and spTlGAMY NATIVE), the computational analysis was additionally conducted again with their native polypeptides. In the case of YALI0B03564g, the necessary change (M ➔ I) slightly decreased the D score value, while in the case of spTlGAMY NATIVE, the change worked the opposite. As shown in Table 2, depending of the combination of the SP and the following protein, the D score values could be either higher or lower with the SoAMY or TlGAMY (MODIFIED ss–pp of interest) when compared to these values obtained with their native polypeptides (NATIVE ss–NATIVE pp), indicating the importance of the global structure of a given SP covering all, the positively charged N-terminus, the hydrophobic core, and the polar C-terminus, which differed depending of the secreted protein (QK for SoAMY, RP for TlGAMY). Yet, the engineered spLip2pre-3xLA was again indicated as a SP with the highest confidence (D scores of 0.815 and 0.833 for SoAMY and TlGAMY, respectively), followed by spSoAMY leader sequence.
It has been reported that the average hydrophobicity of a given SP is an important determinant of whether the protein is targeted to the SRP-dependent or SRP-independent secretory pathway and if it will be translocated co- or post-translationally (Ng et al. 1996). It has been evidenced that by calculating the hydrophobicity values for each amino acid, a putative, optimal signal sequence can be predicted by computational methods (Kyte and Doolittle 1982; Hiller et al. 2004; Petersen et al. 2011). In a study by Yarimizu et al., it was demonstrated that it is rather a structure than a strict sequence being the determinant for the SP operability and that an effective SP requires an adequate hydrophobic core with a defined length and characterized by an optimal hydrophobicity value (Yarimizu et al. 2015). Moreover, it was evidenced that substitutions of glycines into leucines within the hydrophobic core of the SP (increasing hydrophobicity) resulted in improved interaction between an SP and SRP. However, increased hydrophobicity over a hydrophobicity threshold value was harmful for the SP function (Yarimizu et al. 2015). It is also known that individual SPs exhibit different levels of secretory potential in a given species, suggesting a preference towards a particular SP structure amongst different organisms. It was calculated that in S. cerevisiae, the proteins traversing via SRP-independent pathway should have the average hydrophobicity of the 12 residues after the last positively charged residue (HB12 value) of around 2.0 or less, while for the proteins targeted primarily by an SRP-dependent pathway, the HB12 values should be significantly higher, around 3.0 or more (Ng et al. 1996). However, further studies within this field clearly demonstrated that the hydrophobicity cannot be the sole factor driving this phenomenon (Matoba and Ogrydziak 1998). For example, it was shown that secondary structure was an important determinant influencing formation of a productive complex between isolated SPs and SRP. Beta-structure and random or unordered structures were favored in aqueous solution while alpha-helix in the nonpolar environments. It was also evidenced by Matoba and Ogrydziak (1998) that spatial conformation of a SP is another key factor which influences the secretory rate. As demonstrated, a bend introduced by a proline residue directly after the cutting site by a dipeptidase enabled co-translational translocation of AEP protein, while upon elimination of the proline-driven bend, the protein was not translocated. This in turn could be partially alleviated by increasing hydrophobicity of the SP. Ultimately, it was evidenced that the targeting pathway preference and secretory potential of a given SP can be engineered by mutations that have little or no effect on signal peptide hydrophobicity but rather on spatial conformation, and that bended polypeptides with larger radius of gyration interact more readily with SRP, while for more linear SPs, the affinity to SRP can be engineered by increasing hydrophobicity value. Still the authors stated that there have to be some further, unidentified factors driving the secretory potential of the SPs, like amphiphilicity, hydrophobic moment, molecular hydrophobicity potential, or slower rate of synthesis, allowing for longer interaction with SRP (Matoba and Ogrydziak 1998). The latter statement greatly corresponds with the data obtained in this study, as we were able to identify superior and inferior SPs, either if fused with SoAMY or TlGAMY, suggesting a kind of universal observation on these SP performances in Y. lipolytica cells; however, we could not find any positive correlation between experimental data on secretory efficiency and in silico calculated D score or GRAVY values.
With respect to the predicted secondary structure, the SSP9, characterized by the lowest hydrophobicity value, was the least complying with the general SP structure out of the SPs under study. As shown by SOPMA analysis, SP9 was interrupted by extended strands, random coils, but most importantly—by beta turns, making it a special case of SP having beta turn prior to the C-terminal domain (together with SP3 upon cloning with TlGAMY). Moreover, SP9, unlike the other SPs, contains glycine (G) residues within the core region, suggesting that the structure is rather flexible. While SP9 operated relatively poorly with SoAMY, it functioned better with its native protein. The major difference in the N-terminal region structure between the two proteins is the presence of a proline (P) residue directly after the cutting site, which applies to all the variants with TlGAMY. This in turn results in uniform formation of a random coil structure at the end of SPs preceding the mature TlGAMY protein, which were all scored higher values upon SignalP evaluation (higher D scores). The importance of the proline presence immediately after the cutting site for translocation of the polypeptide was demonstrated by Matoba and Ogrydziak (1998), as discussed above. SPs that performed best in secretion of TlGAMY, SP3 and SP4, were terminated with random coil and beta turn structures. Helical structures of SP5 and SP6 cloned with TlGAMY, demonstrating the weakest secretory capacity, were terminated with random coils. However, the same structure was predicted for SP8, SP7, SP1, and also SP10, which, in the three former cases, operated relatively well with this protein, altogether suggesting no straightforward rule for the observed phenomena. On the other hand, when the SPs analyzed in this study were cloned with SoAMY protein, the C-region of the SPs remained in a helical form in most variants, ended with a relatively large and rigid glutamine (Q) residue followed by lysine (K). Beta turn structure in the C-terminus of the SPs followed by SoAMY was identified for SP2 and SP4, the two SPs driving high secretion of the protein. The other analyzed variants of SP-SoAMY constructions, including SP1, SP3, SP6, SP7, and SP10, were all terminated by alpha-helical structure directly prior to the cutting site and drove the secretion with variable strength. The SP5 and SP9, terminated with random coil, functioned relatively poorly with SoAMY, when compared to the other SPs under study.
Finally, the amino acid sequence of the most potent SPs (SP1, SP2, SP3, SP4, SP7, SP8), driving the secretion of the two proteins under study with the highest efficiency, was aligned and a consensus sequence was inferred from this alignment (Table 1.B). Based on this analysis, the following SP consensus sequence was determined: MKFSAALLTAALA(S:V)AAAAA (a sequence of overrepresented amino acid residues). The computed hydrophobicity values for this synthetic SP ranged between 1.468 and 2.158, depending on the S14 or V14 variant, and the extent of the analyzed stretch; however, secondary structure was 100% alpha-helical, which may suggest preference of Y. lipolytica towards such SPs. Calculated D score values (~ 0.7) were comparable irrespective of the following polypeptide, especially for V14 variant. The consensus sequence determined for the 6 best SPs well corresponded with the consensus sequence determined for the 38 proteins (deduced from the in silico proteome analyses; MKFSTILL(A:L)AA(A:L)(A:V)(A:L)(A:L)AA-P; Table 1.A.) considering the properties of the amino acid residues building the SP. Importantly, the consensus sequence of SP derived from the six best SPs was characterized by much lower degeneration at individual sites of the consensus SP.
In conclusion, based on the adopted strategy, we were able to (i) identify novel, previously undescribed in this context SPs, selected from amongst the complete secretome of Y. lipolytica, (ii) characterize their secretory capacity with respect to two model proteins (heterologous amylolytic enzymes), (iii) compare the novel SPs with those previously described and frequently exploited in secretory expression of heterologous proteins in Y. lipolytica, (iv) indicate the most potent SPs to be adopted as building blocks in the molecular toolbox for engineering Y. lipolytica, and (v) suggest a consensus sequence for potentially robust synthetic SP to be used in secretory overexpression in Y. lipolytica.