N- and C-Terminal Truncations to Enhance Protein Solubility and Crystallization: Predicting Protein Domain Boundaries with Bioinformatics Tools

  • Christopher D. O. Cooper
  • Brian D. Marsden
Part of the Methods in Molecular Biology book series (MIMB, volume 1586)


Soluble protein expression is a key requirement for biochemical and structural biology approaches to study biological systems in vitro. Production of sufficient quantities may not always be achievable if proteins are poorly soluble which is frequently determined by physico-chemical parameters such as intrinsic disorder. It is well known that discrete protein domains often have a greater likelihood of high-level soluble expression and crystallizability. Determination of such protein domain boundaries can be challenging for novel proteins. Here, we outline the application of bioinformatics tools to facilitate the prediction of potential protein domain boundaries, which can then be used in designing expression construct boundaries for parallelized screening in a range of heterologous expression systems.

Key words

Bioinformatics Protein expression Protein solubility Protein structure Domain BLAST PSIPRED Hidden Markov Model (HMM) Alignment Secondary structure 



The SGC is a registered charity (number 1097737) that receives funds from AbbVie, Bayer Pharma AG, Boehringer Ingelheim, Canada Foundation for Innovation, Eshelman Institute for Innovation, Genome Canada, Innovative Medicines Initiative (EU/EFPIA) [ULTRA-DD grant no. 115766], Janssen, Merck & Co., Novartis Pharma AG, Ontario Ministry of Economic Development and Innovation, Pfizer, São Paulo Research Foundation-FAPESP, Takeda, and Wellcome Trust [106169/ZZ14/Z]. C.D.O.C. thanks the University of Huddersfield for support.


  1. 1.
    Savitsky P, Bray J, Cooper CD et al (2010) High-throughput production of human proteins for crystallization: the SGC experience. J Struct Biol 172:3–13CrossRefPubMedPubMedCentralGoogle Scholar
  2. 2.
    Mesa P, Deniaud A, Montoya G et al (2013) Directly from the source: endogenous preparations of molecular machines. Curr Opin Struct Biol 23:319–325CrossRefPubMedGoogle Scholar
  3. 3.
    Makrides SC (1996) Strategies for achieving high-level expression of genes in Escherichia coli. Microbiol Rev 60:512–538PubMedPubMedCentralGoogle Scholar
  4. 4.
    Terpe K (2006) Overview of bacterial expression systems for heterologous protein production: from molecular and biochemical fundamentals to commercial systems. Appl Microbiol Biotechnol 72:211–222CrossRefPubMedGoogle Scholar
  5. 5.
    Dale GE, Oefner C, D'Arcy A (2003) The protein as a variable in protein crystallization. J Struct Biol 142:88–97CrossRefPubMedGoogle Scholar
  6. 6.
    Sagemark J, Kraulis P, Weigelt J (2010) A software tool to accelerate design of protein constructs for recombinant expression. Protein Expr Purif 72:175–178CrossRefPubMedGoogle Scholar
  7. 7.
    Graslund S, Sagemark J, Berglund H et al (2008) The use of systematic N- and C-terminal deletions to promote production and structural studies of recombinant proteins. Protein Expr Purif 58:210–221CrossRefPubMedGoogle Scholar
  8. 8.
    Fernandez FJ, Vega MC (2013) Technologies to keep an eye on: alternative hosts for protein production in structural biology. Curr Opin Struct Biol 23:365–373CrossRefPubMedGoogle Scholar
  9. 9.
    Zweers JC, Barak I, Becher D et al (2008) Towards the development of Bacillus subtilis as a cell factory for membrane proteins and protein complexes. Microb Cell Fact 7:10CrossRefPubMedPubMedCentralGoogle Scholar
  10. 10.
    Morello E, Bermudez-Humaran LG, Llull D et al (2008) Lactococcus lactis, an efficient cell factory for recombinant protein production and secretion. J Mol Microbiol Biotechnol 14:48–58CrossRefPubMedGoogle Scholar
  11. 11.
    Mahajan P, Strain-Damerell C, Gileadi O et al (2014) Medium-throughput production of recombinant human proteins: protein production in insect cells. Methods Mol Biol 1091:95–121CrossRefPubMedGoogle Scholar
  12. 12.
    Fernandez-Robledo JA, Vasta GR (2010) Production of recombinant proteins from protozoan parasites. Trends Parasitol 26:244–254CrossRefPubMedPubMedCentralGoogle Scholar
  13. 13.
    Esposito D, Chatterjee DK (2006) Enhancement of soluble protein expression through the use of fusion tags. Curr Opin Biotechnol 17:353–358CrossRefPubMedGoogle Scholar
  14. 14.
    Hammarstrom M, Hellgren N, van Den Berg S et al (2002) Rapid screening for improved solubility of small human proteins produced as fusion proteins in Escherichia coli. Protein Sci 11:313–321CrossRefPubMedPubMedCentralGoogle Scholar
  15. 15.
    Ingolfsson H, Yona G (2008) Protein domain prediction. Methods Mol Biol 426:117–143CrossRefPubMedGoogle Scholar
  16. 16.
    Gopal GJ, Kumar A (2013) Strategies for the production of recombinant protein in Escherichia coli. Protein J 32:419–425CrossRefPubMedGoogle Scholar
  17. 17.
    Derewenda ZS (2010) Application of protein engineering to enhance crystallizability and improve crystal properties. Acta Crystallogr D Biol Crystallogr 66:604–615CrossRefPubMedPubMedCentralGoogle Scholar
  18. 18.
    Gileadi O, Burgess-Brown NA, Colebrook SM et al (2008) High throughput production of recombinant human proteins for crystallography. Methods Mol Biol 426:221–246CrossRefPubMedGoogle Scholar
  19. 19.
    Mooij WT, Mitsiki E, Perrakis A (2009) ProteinCCD: enabling the design of protein truncation constructs for expression and crystallization experiments. Nucleic Acids Res 37:W402–W405CrossRefPubMedPubMedCentralGoogle Scholar
  20. 20.
    IUPAC-IUB Commission on Biochemical Nomenclature (1969) A one-letter notation for amino acid sequences. Tentative rules. Biochem J 113:1–4CrossRefGoogle Scholar
  21. 21.
    Lipman DJ, Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227:1435–1441CrossRefPubMedGoogle Scholar
  22. 22.
    Keates T, Cooper CD, Savitsky P et al (2012) Expressing the human proteome for affinity proteomics: optimising expression of soluble protein domains and in vivo biotinylation. N Biotechnol 29:515–525CrossRefPubMedGoogle Scholar
  23. 23.
    Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215:403–410CrossRefPubMedGoogle Scholar
  24. 24.
    Marchler-Bauer A, Derbyshire MK, Gonzales NR et al (2015) CDD: NCBI's conserved domain database. Nucleic Acids Res 43:D222–D226CrossRefPubMedGoogle Scholar
  25. 25.
    Schultz J, Milpetz F, Bork P et al (1998) SMART, a simple modular architecture research tool: identification of signaling domains. Proc Natl Acad Sci U S A 95:5857–5864CrossRefPubMedPubMedCentralGoogle Scholar
  26. 26.
    Finn RD, Coggill P, Eberhardt RY et al (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res 44:D279–D285CrossRefPubMedGoogle Scholar
  27. 27.
    Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292:195–202CrossRefPubMedGoogle Scholar
  28. 28.
    Rose PW, Prlic A, Bi C et al (2015) The RCSB Protein Data Bank: views of structural biology for basic and applied research and education. Nucleic Acids Res 43:D345–D356CrossRefPubMedGoogle Scholar
  29. 29.
    Lobley A, Sadowski MI, Jones DT (2009) pGenTHREADER and pDomTHREADER: new methods for improved protein fold recognition and superfamily discrimination. Bioinformatics 25:1761–1767CrossRefPubMedGoogle Scholar
  30. 30.
    Jones DT (1999) GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol 287:797–815CrossRefPubMedGoogle Scholar
  31. 31.
    Buchan DW, Minneci F, Nugent TC et al (2013) Scalable web services for the PSIPRED protein analysis workbench. Nucleic Acids Res 41:W349–W357CrossRefPubMedPubMedCentralGoogle Scholar
  32. 32.
    McGuffin LJ, Jones DT (2003) Improvement of the GenTHREADER method for genomic fold recognition. Bioinformatics 19:874–881CrossRefPubMedGoogle Scholar
  33. 33.
    Murzin AG, Brenner SE, Hubbard T et al (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247:536–540PubMedGoogle Scholar
  34. 34.
    Sillitoe I, Lewis TE, Cuff A et al (2015) CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res 43:D376–D381CrossRefPubMedGoogle Scholar
  35. 35.
    Laskowski RA, Hutchinson EG, Michie AD et al (1997) PDBsum: a Web-based database of summaries and analyses of all PDB structures. Trends Biochem Sci 22:488–490CrossRefPubMedGoogle Scholar
  36. 36.
    Dosztanyi Z, Tompa P (2008) Prediction of protein disorder. Methods Mol Biol 426:103–115CrossRefPubMedGoogle Scholar
  37. 37.
    Prilusky J, Felder CE, Zeev-Ben-Mordehai T et al (2005) FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics 21:3435–3438CrossRefPubMedGoogle Scholar
  38. 38.
    Linding R, Russell RB, Neduva V et al (2003) GlobPlot: exploring protein sequences for globularity and disorder. Nucleic Acids Res 31:3701–3708CrossRefPubMedPubMedCentralGoogle Scholar
  39. 39.
    Newman JA, Cooper CD, Aitkenhead H et al (2015) Structure of the helicase domain of DNA Polymerase theta reveals a possible role in the microhomology-mediated end-joining pathway. Structure 23:2319–2330CrossRefPubMedPubMedCentralGoogle Scholar
  40. 40.
    Pettersen EF, Goddard TD, Huang CC et al (2004) UCSF Chimera—a visualization system for exploratory research and analysis. J Comput Chem 25:1605–1612CrossRefPubMedGoogle Scholar
  41. 41.
    Gao X, Bain K, Bonanno JB et al (2005) High-throughput limited proteolysis/mass spectrometry for protein domain elucidation. J Struct Funct Genomics 6:129–134CrossRefPubMedGoogle Scholar
  42. 42.
    Hart DJ, Tarendeau F (2006) Combinatorial library approaches for improving soluble protein expression in Escherichia coli. Acta Crystallogr D Biol Crystallogr 62:19–26CrossRefPubMedGoogle Scholar
  43. 43.
    Petersen TN, Brunak S, von Heijne G et al (2011) SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods 8:785–786CrossRefPubMedGoogle Scholar
  44. 44.
    Bivona L, Zou Z, Stutzman N et al (2010) Influence of the second amino acid on recombinant protein expression. Protein Expr Purif 74:248–256CrossRefPubMedPubMedCentralGoogle Scholar

Copyright information

© Springer Science+Business Media LLC 2017

Authors and Affiliations

  • Christopher D. O. Cooper
    • 1
  • Brian D. Marsden
    • 2
    • 3
  1. 1.Department of Biological Sciences, School of Applied SciencesUniversity of HuddersfieldHuddersfieldUK
  2. 2.Structural Genomics Consortium, Nuffield Department of MedicineUniversity of OxfordOxfordUK
  3. 3.Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, Kennedy Institute of RheumatologyUniversity of OxfordOxfordUK

Personalised recommendations