Background

The production of purified proteins is important for several experimental approaches aimed to assign gene function including antibody generation for immunocytochemistry and immunoprecipitation studies [13], in vitro mapping of protein – protein, protein – DNA or protein – RNA interactions [4, 5] and structure determination [6]. The availability of proteins is also important for biomedical applications such as small molecule drug discovery and the production of therapeutic proteins and vaccines. In these situations it is essential to be able to reliably express the proteins in a heterologous system and purify them so that they possess the same folds and structure as they would in a natural in vivo state. To achieve this on a whole proteome scale a generic approach must be taken to the expression of protein families, unlike the traditional approach of protein chemistry in optimising the isolation of individual proteins on a case by case basis. E. coli has been the expression system of choice for the majority of laboratories engaged in high-throughput, multi-plexed cloning, expression and purification of proteins for structural genomics [7]. The advantages of E. coli as an expression host include well studied physiology, genetics and availability of advanced genetic tools [810], rapid growth, high-level protein production rates achieving up to 10–30% of total cellular protein, ease of handling in a standard molecular biology laboratory, low cost and the ability to multiplex both expression screening [11] and protein production [12]. There are however several disadvantages, particularly for eukaryotic proteins, of expression in a prokaryotic system. The lack of eukaryotic chaperones, specialised post-translational modifications, ability to be targeted to sub-cellular locations or to form complexes with stabilising binding partners can result in protein mis-folding and aggregation. For example, when 2078 randomly selected C. elegans full-length genes were cloned and expressed in E. coli only 11 % yielded soluble protein [13]. Similarly for 44 cloned human proteins, 12 were expressed solubly and 4 purified to homogeneity [14]. With the exception of full-length membrane proteins, the property of protein solubility has been shown to be a good indicator of correct folding as determined by functional binding [15, 16] or enzymatic [17] assays. Purification of inclusion bodies and in vitro refolding has been used in a number of cases, but refolding conditions are highly protein specific and so unlikely to be useful for high-throughput protein expression.

There are several fall-back strategies for expression of correctly folded eukaryotic proteins in E. coli one of which is to truncate long multi-domain proteins into separate domains, as has been performed for the Ephb2 receptor [15, 18, 19]. Reducing translation rates so that proteins have an increased chance of folding into a native state prior to aggregating with folding intermediates, can be successful by lowering the temperature after induction [20] or inducing with lower concentrations of IPTG [21]. Alternate approaches include: co-expressing stabilising binding partners (see review [7]) or chaperones [22]; the induction of chaperones by heat shock [23] or chemical treatment [24]; or the use of genetically modified host-strains that can conduct oxidative protein folding in the cytoplasm [25, 26], over-express rare tRNAs [27] or lipid rafts [28]. Perhaps one of the most successful generic strategies to enhance the expression of soluble proteins is the fusion with solubility enhancing tags, such as maltose binding protein (MBP), thioredoxin (Trx) and glutathione-S-transferase (GST) [2931].

The aim of this work was to ask if it is possible to derive some general conclusions regarding which expression strategy would most likely result in the expression of soluble, functionally active mammalian protein on a family-by-family or domain-by-domain basis. A deep-mining approach was taken to maximise the chances of successful expression by examining the soluble expression of 30 different proteins using 14 different expression vectors. This study allowed us to make several conclusions regarding the best strategies to adopt for the soluble expression of different mammalian proteins in bacteria. The conclusions were tested by the expression of an additional 95 mammalian proteins.

Results

Expression clone construction

The 30 proteins chosen for this expression study are listed in Table 1. With the exception of GFP, they are all human or mouse proteins, and represent several diverse protein families with extra-cellular, cytoplasmic and nuclear cell locations. The list includes a mixture of full-length and truncated proteins expected to be easy or more challenging to express in a bacterial system. Protein truncations were designed to express individual domains annotated from the SwissProt [32] or Pfam [33] databases or following previous examples of successful expression [15]. The genes were isolated from cDNA using a nested PCR strategy [34] or provided by the FlexGene Consortium http://www.hip.harvard.edu/flex_gene/index.htm and sequence confirmed. A recombinational cloning strategy was employed termed "GATEWAY" cloning [35, 36] based on a modification of the phage lambda site-specific recombination system [37]. Primers were designed using the nearest neighbour algorithm [38] and open reading frames (ORFs) were PCR amplified from first strand cDNA with 5' attB1 and 3' attB2 linkers and then recombined with pDONR221 (Invitrogen) to give a set of entry clones which were sequence confirmed and then recombined with various destination vectors to give the expression constructs. Two sets of clones for each ORF were generated with and without stop codons for expression with N or C-terminal tags respectively. Recombinational cloning was useful in this study where the same set of ORFs could be cloned into a large set of different expression vectors without the requirement to check for compatible restriction sites in each vector or their absence within the ORFs.

Table 1 Proteins for expression study with selected features

For this study a set of destination vectors were constructed by modifying pET-DEST42 (see Materials and Methods). The T7 promoter was chosen over other promoters commonly used for bacterial expression because of the high specificity and processivity of T7 RNA polymerase and the wide choice of expression strains currently available. Briefly, multicloning sites were created either 5' of the attR1 or 3' of the attR2 recombination sites for insertion of DNA inserts encoding N or C-terminal tags respectively. The expression vectors contained a T7lac promoter [39] for improved control of basal expression. The N-terminal tag expression vectors contained a sequence at the translational start site to provide a partial match with the down-stream box (ATG AAT CAC CAT), shown to provide enhancement of translation [40] and a decahistidine (H10) tag for enhanced affinity for Nickel resins compared with hexahistidine (H6) tags (data not shown). A fusion partner was inserted between the H10 tag and recombination sites to examine the effect on soluble protein expression. Unlike previous tag comparisons [2931] here the same promoter and 5'-UTR sequence was employed so that any expression differences observed would be purely due to the presence the fusion partner. A vector was also included in this study (pDEST17) with a T7 promoter and no downstream lac operator, which would add a H6 tag at the N-terminus (Figure 1).

Figure 1
figure 1

Expression vector constructs after recombination between the destination and entry plasmids. (A) Schematic representation where shaded and clear boxes indicate translated and untranslated regions respectively. T7 = T7 RNA polymerase promoter, lacO = lac operator, SD = shine dalgarno, H6 or H10 = hexahistidine or decahistidine, attB1 or attB2 = attB recombination sites, ORF = open reading frame, stop = stop codon, fusion = protein fusion (MBP, GFP, GST, Trx, DHFR or Dhfr), V5 = V5 epitope. (B) and (C) DNA sequences of pDEST-N112 and pDESTC102 respectively from T7 RNA polymerase promoter to stop codon.

Effect of different N-terminal fusions on expression

Expression plasmids generated by recombination reactions were used to transform E. coli BL21(DE3), an expression strain containing chromosomally integrated T7 RNA polymerase gene (λDE3 lysogen) under the control of the lacUV5 promoter. To handle a large number of expression experiments (420 total) and associated manipulations to screen for total and soluble expression in E. coli, the recombinational cloning, transformation, growth of expression cultures and cell lysis and filtration separation of insoluble protein were performed in 96-well plate format. Figure 2 shows Western blots for total and soluble protein expression 2 hours after induction with 1 mM IPTG as described in Materials and Methods. The method for separating total from soluble proteins was based on that of Knaust and Nordlund [11] and consisted of detergent lysis of harvested cells followed by filtration through a 0.65 μm 96-well filter plate, which separates larger inclusion bodies from the soluble fraction. The filtration method agrees well with traditional centrifugation methods to separate soluble from insoluble protein [11, 41] and has the advantage that multiple samples can be processed in parallel. Quantitation was achieved by separating the proteins by SDS-PAGE, electro-blotting onto PVDF membranes and detecting His tagged proteins with an anti-His5 monoclonal antibody followed by probing with an anti-mouse Cy-5 labelled antibody. The advantage of expression analysis by Western blot, compared to dot-blots, is that this allows one to quantitate the expression levels of full-length constructs and eliminate the contribution from cleaved protein tag. It was found that Western blots based on fluorescence detection [42] gave a greater dynamic range of detection compared with detection based on enzymatic amplification such as horse radish peroxidase (data not shown). A His-tagged protein molecular weight ladder was used for normalisation to eliminate any blot to blot variation. Table 2 shows the results of this analysis, quantitating expression yields in terms of mg expressed protein per litre of induction media for total and soluble expression. Expression yields greater than 2 mg/l are highlighted in bold.

Figure 2
figure 2

Effect of N-terminal fusion on protein expression Total (A) and soluble (B) expression for protein 1 – 30 (Table 1) with various N-terminal fusion partners analysed by SDS-PAGE fluorescence western blots as described in Materials and Methods. Expression plasmids employed were (a) pDEST17, (b) pDEST-N110 or pDEST-N112 with either (c) MBP, (d) GFP, (e) GST or (f) Trx inserted between the DraIII and BfrBI sites as shown in Figure 1.

Table 2 N-Terminal fusion expression comparison

Looking first at the results for total (soluble and insoluble) expression, no clear patterns emerge for the various expression vectors used. With the exception of CASP2, CDKN2A, Trp53, EGFR(TK), FOS and CD44 most proteins expressed well across all expression vectors. Interesting differences are apparent however when one looks at the production of soluble protein. Using decahistidine green fluorescent protein (H10-GFP) or decahistidine glutathione-S-transferase (H10-GST) as fusion partners at the N-terminus gave poor yields of soluble intact product. This may not be because they were poor at promoting soluble expression but because they were prone to proteolysis during cell lysis reducing the yield of full-length soluble protein. A set of proteins (GFP, RAF1(Ras-bd), HRAS, mdm2(p53-bd), Ephb2(TK) and CCND2) gave high soluble expression levels in the baseline N-terminal decahistidine vector, which was not improved when expressed as decahistidine thioredoxin (H10-Trx) or decahistidine maltose binding protein (H10-MBP) fusions. The molecular weight of these proteins ranged from 9 – 35 Kda and averaged 22.8 Kda. These proteins are all expressed in the cytoplasm, have an average of 1 low-complexity region, 3.8 contiguous hydrophobic amino acids (hp_aa), pI of 6.6, grand average of hydropathicity index (termed GRAVY[43] where increased positive number indicates increased hydrophobicity) of -0.32, 2.6% cysteine residues and no coiled-coil structures. A second grouping of proteins was observed where soluble expression was improved when expressed as H10-Trx or H10-MBP fusions compared with the H10 tag alone. This grouping included GRB2, Efnb2(EC1 or 2), MAD, MAX, Efna1 (FL and EC). The molecular weight of these proteins ranged from 16 – 25 Kda and averaged 20.5 Kda. These proteins were a mixture of those expressed in the cytoplasm, nucleus and extra-cellular, have an average of 0.71 low-complexity regions, 3.6 contiguous hydrophobic amino acids (hp_aa), pI of 6.8, GRAVY score of -0.79 and 1.7% cysteines. A third set of proteins resulted in almost undetectable soluble expression with a H10 tag but good expression with H10-Trx or H10-MBP fusions. These included CDK2, FLI1, CDKN-1B, mdm2, GATA2, Ephb2(LB) and CASP2 with molecular weights ranging from 22.5 – 54.5 Kda, with an average molecular weight of 40.4 Kda. These proteins were also a mixture cytoplasmic, nuclear and extra-cellular proteins, have an average of 2 low-complexity regions, 5 contiguous hydrophobic amino acids (hp_aa), pI of 6.9, GRAVY score of -0.55 and 2.3% cysteines. Finally a set of proteins was grouped (MMP1, FOS, EGFR(TK), Trp53, CD44) where very low (< 1 mg/l) soluble full-length expression was observed, even when expressed as MBP or Trx fusions. Here the molecular weight ranged from 40.7 – 81.6 Kda and averaged 51.4 kDa. These proteins were a mixture of those expressed in the cytoplasm, nucleus and extra-cellular, have an average of 3 low-complexity regions, 5.6 contiguous hydrophobic amino acids (hp_aa), pI of 5.7, GRAVY score of -0.50 and 1.8% cysteine content.

Comparing the 20 mammalian proteins where there are examples in all 6 expression vectors the average yields of soluble protein for the H10, H10-GFP, H10-GST, H10-Trx and H10-MBP tags are 3.3, 1.0, 1.4, 6.0 and 5.8 mg per litre of culture. This ranks the ability of the tag fusions to produce full-length soluble protein as H10-Trx ~ H10-MBP > H10 > H10-GST > H10-GFP. The pDEST17 vector (which encodes a H6 tag) was dramatically poorer at expressing soluble protein compared with the vector pN110 (which encodes a H10 tag), with average soluble expression yields of 0.8 and 3.3 mg per litre of culture respectively. Both vectors contain T7 RNA polymerase promoters, but pN110 also contains a lac operator (lacO) downstream of the promoter and the gene encoding the lac repressor (lacI) for tighter control of gene expression. This may result in a faster rate of transcript synthesis, after induction with IPTG, and hence translation rates (due to an increased concentration of mRNA) for pDEST17 compared with pN110. If translation rate exceeds the rate of protein folding, then increased production of insoluble protein would occur.

Effect of different C-terminal fusions on expression

A similar study was performed where the 30 ORFs were cloned into 8 different C-terminal tag expression vectors shown in Figure 1. C-terminal fusions studied here included V5-H6 or H10 or protein fusions MBP, GST, Trx, murine or human dihydrofolate reductase (Dhfr or DHFR respectively), all with H10 at the C-terminus. The expression screen and quantitation of total and soluble protein expression was performed as for the N-terminal tag study. Figure 3 shows the fluorescence western blots for this C-terminal tag study. Here a greater number of constructs were observed with either undetectable or low levels of expression compared with the N-terminal tag study. Table 3 quantitates the Western blot data for the intact fusion products, with expression yields greater than 2 mg/l in bold. The last row of the table describes the average expression yield for each C-terminal fusion partner. For total protein expression levels there are large expression level differences observed between the various C-terminal tags. The C-terminal decahistidine tag was particularly poor here with an average total expression yield of only 0.7 mg/l compared with 7.3 mg/l when this tag was fused to the N-terminus. In contrast the C-terminal MBP-H10 tag resulted in an average total expression yield of 20.2 mg/l. The ranking of the C-terminal fusion partners in promoting total expression was MBP-H10 > GST-H10 > V5-H6 > Trx-H10 > Dhfr-H10 > DHFR-H10 > GFP-H10 > H10.

Figure 3
figure 3

Effect of C-terminal fusion on protein expression Total (T) and soluble (S) expression for protein 1 – 30 (Table 1) with different C-terminal fusion partners analysed by SDS-PAGE fluorescence western blots as Figure 2. Expression plasmids employed were (g) pET-DEST42, (h) pDEST-C101 or pDEST-C102 with either (i) MBP, (j) GST, (k) GFP (l) Trx (m) Dhfr or (n) DHFR inserted between the DraIII and BfrBI sites as shown in Figure 1.

Table 3 C-Terminal fusion expression comparison

MBP-H10 was the most effective tag at the C-terminus to promote protein solubility with an average construct full-length soluble yield of 5.0 mg/l, which compares well with an average of 5.8 mg/l when this tag is fused at the N-terminus. The order of C-terminal tags to promote soluble expression was similar for total expression: MBP-H10 > GST-H10 > V5-H6 > Dhfr-H10 ~ GFP-H10 ~ Trx-H10 > H10 ~ DHFR-H10. Thioredoxin was not as effective a solubility enhancing tag when fused at the C-terminus with an average soluble yield of only 0.7 mg/l compared with 6.0 mg/l when fused to the N-terminus.

Several correlations with protein features are seen when one groups the MPB fusions according to soluble protein expression levels. For the first group, where soluble expression levels were in the range of 5 – 50 mg/l, the average molecular weight, pI and GRAVY score were 20.6 KDa, 5.9 and -0.58 respectively. The average numbers of contiguous hydrophobic amino acids, low complexity and coiled-coil regions were 3.1, 0.56 and 0.22 respectively. The second group displayed soluble expression levels between 1 – 5 mg/l. Here, the average molecular weight, pI and GRAVY score were 25.1 KDa, 7.9 and -0.39 respectively and the average numbers of contiguous hydrophobic amino acids, low complexity and coiled-coil regions were 4.3, 0.71 and 0 respectively. The last group displayed soluble expression levels between 0 – 1 mg / l. Here the average molecular weight, pI and GRAVY score were 41.1 KDa, 6.2 and -0.51 respectively and the average numbers of contiguous hydrophobic amino acids, low complexity and coiled-coil regions were 5, 2.43 and 0.21 respectively. There were representatives of nuclear, cytoplasmic and extra-cellular proteins in all three groupings.

Expression of a test set of 95 mammalian proteins

A diverse set of proteins were chosen to test the conclusions of this study (Table 4). They range from proteins that are well annotated, some of which have been expressed in E. coli previously (Nfkb1), to those that contain no PfamA domains and have not been expressed in E. coli previously (Maat1, BC031407, Ttyhl, 1500001H12RIKEXT2, Ext2, KIAA1136, G2 and KIAA1549). They included 24 proteins not annotated as PfamA domains, with unknown function. All cDNAs were amplified from a primary cDNA library, cloned into pDONR221 and sequence confirmed prior to transfer to pDEST-N112-MBP (Figure 1) for expression as N-terminal H10-MBP fusions. In some cases primers were designed to clone protein fragments to express particular PfamA domains or minimise the molecular weight or numbers of low complexity (LC) regions or contiguous hydrophobic amino acids (hp_aa). For proteins with no PfamA annotations, such as BC031407, SMART sequence analysis [44] was performed to identify the low complexity regions of the protein and truncations performed accordingly. Protein expression and quantitation of intact soluble fusion protein product was performed as for the N- and C-terminal tag comparison study. The total and soluble expression levels (mg of protein per litre culture) are listed in the last column of Table 4 together with selected protein features. 63 of the 95 proteins yielded soluble expression levels of greater than 1 mg/l and the average molecular weight, number of LC regions and hp_aa for these proteins was 24.4 kDa, 0.9 and 3.7 respectively. For the 32 proteins that failed to give soluble product of the correct size, the average molecular weight, number of LC regions and hp_aa was 37.1 kDa, 1.8 and 4.5 respectively.

Table 4 Expression levels of mammalian proteins expressed with N-terminal H10-MBP fusions, with selected protein features

Discussion

Correlation between protein properties and solubility

To guide future expression strategies for new proteins, particularly regarding the choice of expressing a full-length protein in a bacterial or eukaryotic system and also where to truncate multi-domain containing proteins, it is interesting to investigate if the proteins expressed in a soluble form in this study share any common properties. Recently Goh et al. [45] used data generated by a structural genomics consortium to examine the ability of proteins to progress from cloning to expression and purification to crystallisation. The data used was very large, consisting of 27,000 targets from over 120 organisms and a number of important features were inferred that correlated with success including percentage composition of charged residues, occurrence of hydrophobic patches and length. Although a large study, there was a problem with interpretation of all the data-sets as it was unclear whether targets were simply waiting in the pipeline or had failed. Also structural genomics targets are often initially biased in favour of easy to express proteins, not representative of the whole proteomes of these organisms.

The present study, focused on mammalian proteins from several diverse families, examined the relationship between successful soluble expression with various protein properties. Several protein features were identified in this study to correlate with soluble expression, which had not previously been shown experimentally. For both the N and C-terminal tag expression studies it was observed that the presence of several features did not correlate with successful expression including protein pI, grand average of hydropathicity index (GRAVY) [43], sub-cellular location, the cysteine content as a percentage of the total number of amino acids and the number of coiled-coils. Protein pI has been linked to sub-cellular location [46] with a bimodal distribution observed in bacterial and archaeal genomes and trimodal pattern in eukaryotes. Proteins are thought to be less soluble at a pH environment near their pI. GRAVY simply calculates overall hydrophobicity of the linear polypeptide sequence with increasing positive score indicating greater hydrophobicity, but no account is taken of the way the protein folds in three dimensions or the percentage of residues buried in the hydrophobic core of the protein. In a recent study Luan et al. [47] tested the soluble expression of 10,167 full-length C. elegans ORFs and found that protein hydrophobicity was an important factor for an ORF to yield a soluble expression product. This different result may be attributable to the fact that the C. elegans study included a greater proportion of membrane proteins. Therefore the lack of correlation between GRAVY score and soluble expression we observed may be true for non-membrane proteins or for proteins where the trans-membrane domain has been deleted.

There was a strong correlation between successful soluble expression and molecular weight of the protein. Small proteins with an average molecular weight of 22.8 KDa did not require to be fused with solubility enhancing proteins for soluble expression whereas proteins that required to be fused with N-terminal MBP or Trx for soluble expression had an average molecular weight of 40.4 KDa and those where the addition of a N-terminal fusion could not rescue soluble expression had an average size of 51.4 KDa. The same pattern also emerged in the C-terminal fusion study. The decreasing probability of successful soluble expression of mammalian proteins with increasing molecular weight is likely due to increasing protein complexity, perhaps requiring specialised eukaryotic chaperones for folding or stabilising binding partners. The majority of proteins solubly expressed in this study contained single domains and as fusion proteins were either capable of self-folding or were folded with the aid of prokaryotic chaperones. Braun et al. found a similar relationship with their set of 32 human proteins with 4 different N-terminal fusions [30].

A correlation in this study was observed between increasing numbers of contiguous hydrophobic amino (hp_aa) acids (AILFWV) and soluble expression. This ranged from an average of 3.8 hp_aa for those proteins not requiring a N-terminal fusion for high level soluble expression to 5 hp_aa for proteins requiring a N-terminal fusion for successful expression and 5.6 hp_aa where expression failed under the conditions described here. This pattern was also repeated in the C-terminal fusion study where good expression proteins had an average of 3.1 hp_aa whereas poor expression proteins had an average of 5 hp_aa. In a study of the sequences of 2753 non-membrane proteins it was found that the sequences of three or more consecutive hydrophobic residues are suppressed in globular proteins [48]. Low complexity regions of proteins are regions of a protein of biased composition containing a small number of amino acids [33] and can have a disordered structure important for protein function [49]. Here we found that the greater the number of low complexity regions contained within the target protein, the less likely soluble expression would be achieved. This was true for both the N- and C-terminal fusion protein studies with 0.6 – 1 low complexity regions for proteins easy to express in a soluble form to 2.4 – 3 low complexity regions for proteins difficult to express. Low complexity regions are less common in bacterial proteins and these may be targets for proteolytic degradation in vivo.

Some interesting conclusions were drawn when soluble expression was measured for an additional set of 95 mammalian proteins expressed as H10-MBP fusions (Table 4). In several cases (ELF1, Fli1, Ldb1, BC031407, Nfkb1 and RelA-p65) truncating the proteins to minimise the molecular weight and the numbers of low complexity regions and contiguous hydrophobic amino acids made the difference between failed expression and good soluble protein expression. For proteins such as BC031407, with no annotated PfamA domains, it was found that truncating at low complexity regions was a good method to identify a fragment that could express in a soluble form of the correct size (protein 81). Although we found that successful soluble expression of the 95 protein set correlated with lower molecular weight, number of low complexity regions and contiguous hydrophobic amino acids compared with proteins that failed to express solubly with the correct size, validating our earlier conclusions, there were some exceptions. For example Elf1 and Gata1 both expressed well despite having 4 and 6 low complexity regions respectively and molecular weights of 66 and 42.5 kDa, whereas some smaller proteins such as the PDZ domains of Dlgh3 and Grip1 failed to express. It may be that there are additional protein features, such as the ability to form a stabilising interaction with a binding partner, that are also important for soluble expression. Also ensuring correct protein domain boundaries may be important since the annotated Pfam domain boundaries, based on sequence alignment, do not always match the structural or folding domain boundaries.

Protein fusions that enhance protein solubility

There have been three comparative studies recently where sets of proteins were cloned into several expression vectors and the effects of the fusion partner on total and soluble expression yield were examined. Hammarstrom et al. [29] cloned 27 human proteins (MW < 20 Kda) into various expression vectors and ranked the tags ability to promote soluble expression as Trx ~ MBP ~ Gb1 > ZZ > NusA > GST > His6. Another study ranked tags in terms of increased expression and yield after purification as GST ~ MBP > CBP > His6 when comparing the expression of 32 human proteins where the molecular weight varied from 17 – 110 kDa.[30] Here GST was preferred because of the weak affinity between MBP and amylose resin. In a third study of 40 different proteins (10 mammalian, 3 plant and 2 insect) with 8 different tags MBP gave the best overall results in terms of total and soluble expression [31]. However, these studies used different combinations of promoter and fusion partner, so it was unclear whether the observed effect was purely due to expression with the fusion partner or variable rates of transcript synthesis that would also affect translation rates.

In this study it was found that, on average, N-terminal fusion partners are preferable for optimal protein expression. When proteins are expressed with their native N-terminus, as in our C-terminal fusion proteins, total expression levels can be more variable than when expressed with a constant N-terminal tag. This may be because of variable RNA secondary structures in the region around the start codon which could interfere with ribosome binding. An additional explanation is that during translation the expressed protein emerges from the ribosome first and initiates an incorrect, irreversible, folding pathway before the soluble fusion partner has been translated and folded. The mis-folded protein would be ubiquitin labelled and targeted to the proteasome for degradation resulting in lower total expression levels. This scenario is more likely when expressing mammalian proteins in a bacterial system which lacks specific eukaryotic chaperone proteins. It has been shown previously that proteins prone to mis-folding and aggregation can arrest GFP folding when fused at the C-terminus [17]. However, when the soluble protein is fused at the N-terminus, this would be translated first and perhaps increase the solubility of the downstream protein domain folding intermediates, increasing their half lives prior to irreversible aggregation. This would allow greater reversibility in the individual steps along the folding pathway and increase the probability that the protein would eventually reach the lowest free energy native conformation.

It was found that Trx and MBP were the best N-terminal protein fusions to promote protein solubility. The best C-terminal fusion to promote protein solubility was MBP and this may be acting as a true intra-molecular chaperone [50], able to promote folding of the N-terminal protein fusion. The mechanism could be due to direct binding to folding intermediates [51], allowing stabilisation prior to correct folding and inhibition of aggregate formation. The observation that MBP was effective at enhancing soluble expression when fused at the C-terminus, in contrast to thioredoxin, suggests that MBP can actually reverse the process of incorrect folding that would have started prior to the translation of the downstream MBP. This property was not observed for thioredoxin when fused to the C-terminus suggesting either that, in three-dimensions, different proximal faces of the fusion partners have different solubility enhancing properties or that thioredoxin does not posses any chaperone properties and acts only as a solubility enhancer. Alternatively, the folding of thioredoxin may be more prone to inhibition than MBP. Also there are examples where MBP fusions can form soluble inclusion bodies [52, 53], and this cannot be ruled out as a possibility here, although there are also several examples where MBP fusion proteins are fully functionally active [50, 52, 54, 55].

It must be stressed here that although protein solubility is a useful indicator of correct folding, additional measurements need to be performed to give supporting evidence for correct folding. These may include removing the protein fusion with a protease and analysis of the cleaved protein of interest by a variety of biophysical and functional assays such as analysis of monodispersity by light scattering [52], NMR [56, 57], CD spectropolarimetry, bis-ANS binding [53], ligand binding or enzymatic activity. In this study a protease cleavage site was not included in the vector constructs because the main use of the proteins generated in our laboratory will be in high-throughput antibody production where the cleavage of the fusion partner is unnecessary.

GFP did not significantly enhance soluble protein expression when fused to the C-terminus of the proteins in this study, supporting the use of this tag as an indicator of soluble protein expression of fused ORFs.[17, 41] The observation that the V5-His6 tag resulted in a higher average soluble expression level than the His10 tag (1.7 compared with 0.3 mg/l) indicates that the identity of the peptide tag can also affect overall solubility of expressed proteins.

Conclusions

What guidelines have emerged from this study in developing a strategy for the production of soluble mammalian proteins in E. coli? If the protein has a molecular weight of less than 30 KDa and contains 1 or less low complexity regions and less than 4 contiguous hydrophobic amino acids expression of the full-length protein in E. coli should give good levels of soluble protein. As a generic strategy we would recommend expressing the protein with a fusion partner and found MBP and Trx to be the best fusions to enhance protein solubility as N-terminal tags with MBP being superior as a C-terminal fusion. C-terminal fusions are desirable for proteins such as the P450s where N-terminal tags can inhibit functional activity. When fused to an optimal fusion partner, nuclear, cytoplasmic and extra-cellular domains were equally likely to be expressed solubly. For larger proteins over 50 KDa, truncations should be considered to express specific protein domains and to minimise the molecular weight, number of low complexity regions and contiguous hydrophobic amino acids. In conclusion, this study will help enable a systematic expansion in the number mammalian proteins and domains that can be successfully expressed in E. coli as soluble product, and also predict which are best targeted for a eukaryotic expression system.

Methods

Materials

Oligonucleotides were synthesised by Qiagen-Operon (Cologne, Germany) or Sigma-Genosys (Haverhill, UK). All restriction enzymes were from New England Biolabs (Hitchin, UK). The vectors pET-DEST42, pDEST17 and pDONR201 and E. coli DB3.1 and BL21(DE3)Star pLysS, Gateway BP and LR clonase enzyme mix, pre-cast 4–12 % NuPAGE Bis-Tris gels and PVDF membranes (0.45 μm pore size) were all from Invitrogen (Paisley, UK). Entry plasmids in both open (minus stop codon) or closed format (plus stop codon) containing the full-length genes for GRB2, HRAS, JUN, FOS, MAD, MAX, CDK2, CDK4, CDKN1B, CASP2, MMP1, CDKN2A and CD44 were provided by Pascal Braun and Josh LaBaer (Harvard Institute of Proteomics, Cambridge, USA). A full length clone containing the full-length human EGFR ORF was provided by the RIKEN BioResource Center (Tsukuba, Japan) and Efna1 from the Mammalian Gene Collection (MGC) archived at the Wellcome Trust Sanger Institute (Hinxton, UK). First strand synthesis human and mouse cDNA was from BD Biosciences (Oxford, UK). Plasmid, gel extraction and PCR purification kits and 6xHis protein ladder were purchased from Qiagen (Crawley, UK). The expression strain BL21(DE3), BugBuster protein extraction reagent and His tag monoclonal antibody was from Merck Biosciences (Nottingham, UK). The 96-well multiscreen-DV durapore filter plate with 0.65 μm pore size was from Millipore (Watford, UK) and Cy5-labelled goat anti-mouse IgG from Amersham Biosciences (Little Chalfont, UK). Europium labelled antibodies and DELFIA reagents were from Perkin Elmer (Beaconsfield, UK) and all other chemicals unless otherwise stated were from Sigma-Aldrich (Gillingham, UK).

N-Terminal fusion GATEWAY destination vector construction

To prepare pET-DEST42-MCS, a multi-cloning site was inserted into pET-DEST42 (Invitrogen) at nt396, between the shine-dalgarno sequence and the attR1 recombination site, encoding the recognition sequences for NdeI, KpnI, DraIII and BfrBI. Inverse or whole plasmid PCR was performed on pET-DEST42 with 5'-phosphorylated PAGE purified primer pairs 20 (5' TACCCACGAAGTGATGCATACAAGTTTGTACAAAAAAGCTGAACG 3') and 21 (5' CCCATATGTATATCTCCTTCTTAAAGTTAAACAAAATTATTTCTAGAG 3') in a 20 μl reaction containing 10 ng pET-DEST42, 0.3 μM primers 20 and 21, 20 mM Tris-HCl (pH 7.5), 0.5 mM DTT, 200 μM each of dATP, dCTP, dGTP and dTTP, 1 mM MgSO4, and 0.5 unit KOD hot start DNA polymerase (Novagen). PCR cycling conditions were: 94°C – 2 mins followed by 15 cycles of 94°C – 15 s, 59°C – 30 s, 68°C – 9 mins. The 7468 bp PCR product was purified using a PCR purification spin column (Qiagen) and eluted with 30 μl of 10 mM Tris-HCl (pH8.5), digested with 20 units of DpnI enzyme at 37°C for 4 hrs, to remove methylated plasmid DNA, purified by spin column and an intramolecular ligation reaction performed using 16 ng of linear PCR product and 5 units T4 DNA ligase and the buffers from the rapid ligation kit (Roche). The ligated PCR product was used to transform E. coli DB3.1 and the resultant pET-DEST42-MCS plasmid DNA prepared and sequence confirmed. Insert 1, encoding a decahistidine tag with a 5'-NdeI site and blunt 3' end, was prepared by PCR with primer pairs 22 (5' GGAATTCCATATGAAUCAC 3') and 24 (5' pGTGATGGTGATGGTGATGGTGATGGTGATTCATATGGAATTCC) and insert 2 encoding a decahistidine tag flanked by a 5'-NdeI site and 3'-KpnI site was prepared with primer pairs 22 and 26 (5' CGGGGTACCATGGTGATGGTGATGGTGATGGTGATGGTGATTCATATGGAATTCC 3'). PCR reactions were as above except the annealing temperature dropped to 44°C, extension time to 10 s and 12 cycles employed. Insert size was checked by 10 % TBE-PAGE and purified by a nucleotide removal kit (Qiagen). Expression vectors (b) pDEST-N110 and pDEST-N112 (Figure 1) were prepared by digestion of inserts 1 and 2 with NdeI only or NdeI and KpnI combined respectively, purified by spin column and ligated in a 1:1 ratio to NdeI, BfrBI or NdeI, KpnI digested pET-DEST42-MCS respectively prior to transformation of E. coli DB3.1. Inserts encoding MBP, GFP, GST or Trx flanked by a 5' DraIII site and a 3' blunt end were generated by PCR amplification from the plasmids pMALc2 (New England Biolabs), pET41a or pET32 (Novagen) respectively The primer pairs for MBP were 78 (5' TTATTACACGAAGTGAAAATCGAAGAAGGTAAACTGGTAATC 3') and 79 (5' pGTTCGAGCTCGAATTAGTCTGCGCGTCTTTC), for GFP 84 (5' TTATTACACGAAGTGGCTAGCAAAGGAGAAGAACTTTTCACTGGAG 3') and 85 (5' pTTTGTAGAGCTCATCCATGCCATGTGTAATC 3'), for GST 86 (5' TTATTACACGAAGTGTCCCCTATACTAGGTTATTGGAAAATTAAGGG 3') and 87 (5' pATCCGATTTTGGAGGATGGTCGCCACC 3') and for Trx 88 (5' TTATTACACGAAGTGAGCGATAAAATTATTCACCTGACTGAC 3') and 89 (5' p CAGGTTAGCGTCGAGGAACTCTTTC 3'). The inserts were digested with DraIII and ligated with DraIII, BfrBI cut pDEST-N112 vector to create the GATEWAY destination vectors pDEST-112-MBP, pDEST-112-GFP, pDEST-112-GST, pDEST-112-Trx.

C-Terminal fusion GATEWAY destination vector construction

pDEST-C101 was designed to insert a decahistidine encoded sequence between the attR2 recombination site and T7 transcription termination region. pDEST-C102 is as C101 except a DraIII, BfrBI site was inserted downstream of the attR2 recombination site. Inverse PCR was performed as described above with primer pairs 1 (5' pCACCATCACCATCATCACCATCACCATTGAGTTTGATCCGGC) and 2 (5' pATGCACCACTTTGTACAAGAAAGCTGAAC) to generate pDEST-C101 and primer pairs 1 and 3 (5' pATGCATACCACTCACTTCGTGCACCACTTTGTACAAGAAAGCTGAAC) to prepare pDEST-C102. Murine and human dihydrofolate reductase (Dhfr and DHFR respectively) inserts flanked by a 5' DraIII site and blunt end at the 3' were amplified from MGC clones using the primer pairs 82 (5' TTATTACACGAAGTGCGACCATTGAACTGCATCGTCGCCGTG) and 83 (5' pGTCTTTCTTCTCGTAGACTTCAAACTTATAC 3') for Dhfr and 80 (5' TTATTACACGAAGTGGGTTCGCTAAACTGCATCGTCGCTGTG) and 81 (5' pATCATTCTTCTCATATACTTCAAATTTG) for DHFR. The DraIII digested inserts were ligated with DraIII, BfrBI digested pDEST-C102 vector to create pDEST-C102-MBP, GFP, GST, Trx, Dhfr and DHFR as shown in Figure 1.

cDNA isolation and expression clone generation

A nested PCR strategy was used to isolate protein encoding ORFs directly from cDNA adapted for GATEWAY cloning from the method described by J. E. Collins et al. [34]. Briefly 2 sets of primer pairs were designed, the first pair of optimised primers binding 1 – 200 bp 5' and 3' of the ORF using DS-Gene software (Accelerys) and a second set of primers targeted to the beginning and end of the ORF. All primers were designed with melting temperatures around 60°C. PCR 1 contained 50 pg of either human universal QUICK-clone II cDNA (Clontech) or 50 pg of a mixture of mouse brain, heart, kidney, liver, smooth muscle, spleen, testis and 7, 11, 15 and 17-day embryo QUICK-clone cDNA (Clontech), 0.25 μM primers, 20 mM Tris-HCl (pH 7.5), 0.5 mM DTT, 200 μM each of dATP, dCTP, dGTP and dTTP, 1 mM MgSO4, and 0.5 unit KOD hot start DNA polymerase (Novagen) in a total volume of 20 μl. The PCR reaction consisted of 94°C – 2 mins, and 30 cycles of 94°C – 15 s, 55°C – 30 s, 68°C – 2.5 mins followed by 68°C – 5 mins. A 50-fold dilution of the PCR 1 reaction was made for the second 30 cycle PCR containing the ORF specific primers. Linkers were added to these primers encoding half the attB1 and attB2 sites for forward and reverse primers respectively. For entry clone generation to be transferred to N-terminal tag expression vectors the 5'-linkers for the forward and reverse primers were 5' AAAAAGCAGGCTCT 3' and 5' AGAAAGCTGGGTTCTA 3' respectively with the reverse primer adding a stop codon. For inserts destined to the C-terminal tag expression vectors the forward and reverse primers were 5' AAAAAGCAGGCTTCGAAGGAGATAGAACCATGG 3' and 5' AGAAAGCTGGGTT 3' respectively with the forward primer encoding the shine-dalgarno and kozak sequences and start codon. PCR 2 products were analysed by 1 % TBE-agarose electrophoresis[58] and correct size fragments were then subjected to an adapter PCR step to complete the flanking attB1 and attB2 sites. This consisted of a PCR reaction as described above using 1 μl of a 50-fold dilution of the PCR 2 reaction in a total volume of 20 μl and primer pair 113 (5' GGGGACAAGTTTGTACAAAAAAGCAGGCT 3') and 114 (5' GGGGACCACTTTGTACAAGAAAGCTGGGT 3') except that the annealing temperature was 45°C, only 12 cycles were used and extension time was 2 mins. The products of the adapter PCR were purified by a 96-well PCR clean-up kit (Qiagen), eluted in 100 μl 10 mM Tris-HCl (pH8.5) and had an average concentration of 40 ng /μl. Recombinational cloning of attB flanked PCR products with an attP containing pDONR vector to generate a set of entry plasmids was as described previously [35] except that pDONR221 (Invitrogen) was used. The ORFs within sequence confirmed attL containing entry plasmids were then recombined the various attR destination vectors described above to generate sets of expression plasmids. The LR recombination reactions [35] were used to transform E. coli DH5α cells, miniprep plasmid DNA prepared and this used to transform the various BL21(DE3) expression strains used in this study.

Expression screening and quantitation

All BL21(DE3) transformants were selected and propagated in the presence of 100 μg/ml ampicillin. A single antibiotic resistant colony was used to inoculate 0.5 ml 2xYT media in a 96-deep well block containing the appropriate antibiotics and shaken at 210 rpm at 37°C. When the average OD600 had reached 1 (3 hrs for BL21(DE3)), 60 μl was transferred to 1.2 ml 2xYT media in a 96-deep-well block containing the appropriate antibiotics, placed on a shaking incubator at 37°C and when the OD600 reached 0.5 (2 hrs for BL21(DE3)) IPTG added to a final concentration of 1 mM and shaking continued at 25°C for 12 hours. Total protein was analysed by transferring a 20 μl aliquot of the induced culture to a 96-well PCR plate containing 20 μl of 2 × NuPage LDS loading buffer (Invitrogen), 0.1 M DTT, heated to 95°C for 10 mins and cooled on ice prior to loading 10 μl on a 17-well 4–12 % NuPAGE Bis-Tris gels with a multi-channel gel loading syringe (Hamilton). Soluble protein was extracted by transferring 290 μl of induced culture to a shallow well plate, centrifugation at 3000 g for 5 mins, supernatant removed and cells were resuspended in 58 μl BugBuster containing 1.4 units of benzonase and 58 units of recombinant lysozyme (Novagen). For the C-terminal tag and expression strain comparison this buffer was also supplemented with 0.58 μl protease inhibitor cocktail set III 10-fold diluted in DMSO (Novagen). The cell-pellets were resuspended with a multi-channel pipette and incubated with slow shaking for 20 mins at room temperature prior to transfer to 96-well multiscreen-DV durapore filter plates with 0.65 μm pore size (Millipore). The filter plate was placed on top of a shallow 96-well plate and centrifuged at 1000 g for 2 mins. 4 μl of the filtrate was then added to a 96-well plate containing 5 μl of 4 × NuPage LDS loading buffer (Invitrogen), 11 μl of 182 mM DTT, the plate heated at 95°C for 5 mins and loaded onto a 17-well 4–12 % NuPAGE Bis-Tris gel. A His-tagged molecular weight ladder (Qiagen) was also loaded onto each gel. Gel electrophoresis and electro-transfer to PVDF membrane was as described.[58] Blots were blocked with 3 % Marvel milk powder in PBS-Tween (PBS with 0.1% Tween) either 1 hour at room temperature or over-night at room-temperature, washed with PBS-Tween and incubated with 40 ng/ml anti-His5 tag monoclonal antibody (Novagen), 3 % Marvel, PBS-Tween for 1 hr, washed 3 × PBS-Tween, incubated with 1 μg/ml Cy5 labelled goat anti-mouse in 3% Marvel, PBS-Tween for 1 hr, washed 3 × PBS-Tween and 2 × PBS and blots dried at 37°C for 10 mins between blotting paper. The blots were scanned on a Typhoon 8600 variable mode imager (Amersham) with fluorescence scan mode, 633 nm excitation laser, 670 nm emission filter, 600 V PMT and 200 μm / pixel scan resolution. The integrated fluorescence intensity volumes of bands on the gel were quantitated using ImageQuant TL software (Amersham). Conversions to protein yield were made by using a calibration curve of purified His-tagged single chain antibody (scFv). Differences between the molecular weight (MW) of the scFv (31 KDa) and each expressed fusion protein were taken into account by multiplying each protein quantitation by the ratio MW construct (KDa) / 31. The numbers were normalised to eliminate blot to blot variation using a His-tagged molecular weight ladder (Qiagen).