Abstract
It has been previously shown that protein sequences containing a quasi-repetitive assortment of amino acids are common in genomes and databases such as Swiss-Prot but are under-represented in the structure-based Protein Data Bank (PDB). Structural genomics groups have been using the absence of these “low-complexity” sequences for several years as a way to select proteins that have a good chance of successful structure determination. In this study, we examine the data deposited in the PDB as well as the available data from structural genomics groups in TargetDB and PepcDB to reveal interesting trends that could be taken into consideration when using low-complexity sequences as part of the target selection process.
Similar content being viewed by others
Abbreviations
- CESG:
-
Center for Eukaryotic Structural Genomics
- PDB:
-
Protein Data Bank
- NMR:
-
Nuclear magnetic resonance
- HSQC:
-
Heteronuclear single quantum coherence
References
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The Protein Data Bank. Nucleic Acids Res 28(1):235–242
Canaves JM, Page R, Wilson IA, Stevens RC (2004) Protein biophysical properties that correlate with crystallization success in Thermotoga maritima: maximum clustering strategy for structural genomics. J Mol Biol 344(4):977–991
Daughdrill GW, Chadsey MS, Karlinsey JE, Hughes KT, Dahlquist FW (1997) The C-terminal half of the anti-sigma factor, FlgM, becomes structured when bound to its target, sigma 28. Nat Struct Biol 4(4):285–291
Dunker A, Lawson J, Brown C, Williams R, Romero P, Oh J, Oldfield C, Campen A, Ratliff C, Hipps K, Ausio J, Nissen M, Reeves R, Kang C, Kissinger C, Bailey R, Griswold M, Chiu W, Garner E, Obradovic Z (2001) Intrinsically disordered protein. J Mol Graph Model 19(1):26–59
Goh CS, Lan N, Douglas SM, Wu B, Echols N, Smith A, Milburn D, Montelione GT, Zhao H, Gerstein M (2004) Mining the structural genomics pipeline: identification of protein properties that affect high-throughput experimental analysis. J Mol Biol 336(1):115–130
Golding GB (1999) Simple sequence is abundant in eukaryotic proteins. Protein Sci 8(6):1358–1361
Huntley MA, Golding GB (2002) Simple sequences are rare in the Protein Data Bank. Proteins 48(1):134–140
Huth JR, Bewley CA, Nissen MS, Evans JN, Reeves R, Gronenborn AM, Clore GM (1997) The solution structure of an HMG-I(Y)-DNA complex defines a new architectural minor groove binding motif. Nat Struct Biol 4(8):657–665
Kay BK, Williamson MP, Sudol M (2000) The importance of being proline: the interaction of proline-rich motifs in signaling proteins with their cognate domains. Faseb J 14(2):231–241
Li X, Romero P, Rani M, Dunker AK, Obradovic Z (1999) Predicting protein disorder for N-, C-, and internal regions. Genome Inform Ser Workshop Genome Inform 10:30–40
Linding R, Jensen L J, Diella F, Bork P, Gibson TJ, Russell RB (2003) Protein disorder prediction: implications for structural proteomics. Structure 11(11):1453–1459
Linding R, Russell RB, Neduva V, Gibson TJ (2003) GlobPlot: exploring protein sequences for globularity and disorder. Nucleic Acids Res 31(13):3701–3708
Liu J, Tan H, Rost B (2002) Loopy proteins appear conserved in evolution. J Mol Biol 322(1):53–64
Marcotrigiano J, Gingras AC, Sonenberg N, Burley SK (1999) Cap-dependent translation initiation in eukaryotes is regulated by a molecular mimic of eIF4G. Mol Cell 3(6):707–716
Michelitsch MD, Weissman JS (2000) A census of glutamine/asparagine-rich regions: implications for their conserved function and the prediction of novel prions. Proc Natl Acad Sci USA 97(22):11910–11915
Nandi T, Dash D, Ghai R, B-Rao C, Kannan K, Brahmachari SK, Ramakrishnan C, Ramachandran S (2003) A novel complexity measure for comparative analysis of protein sequences from complete genomes. J Biomol Struct Dyn 20(5):657–668
Oldfield CJ, Ulrich EL, Cheng Y, Dunker AK, Markley JL (2005) Addressing the intrinsic disorder bottleneck in structural proteomics. Proteins 59(3):444–453
Romero P, Obradovic Z, Dunker K (1997) Sequence data analysis for long disordered regions prediction in the Calcineurin family. Genome Inform Ser Workshop Genome Inform 8:110–124
Romero P, Obradovic Z, Li X, Garner EC, Brown CJ, Dunker AK (2001) Sequence complexity of disordered protein. Proteins 42(1):38–48
Shin SW, Kim SM (2005) A new algorithm for detecting low-complexity regions in protein sequences. Bioinformatics 21(2):160–170
Sim KL, Creamer TP (2002) Abundance and distributions of eukaryote protein simple sequences. Mol Cell Proteomics 1(12):983–995
Wootton JC (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 18(3):269–285
Wootton JC, Federhen S (1993) Statistics of local complexity in amino acid sequences and sequence databases. Comput Chem 17(2):149–163
Acknowledgments
The authors thank Dmitry A. Kondrashov and John L. Markley for helpful comments regarding this paper. The authors would also like to thank Sarah C. Cunningham for assistance with the statistics tests. R.M.B was supported by NLM training grant T15LM007359 and DOE training grant DE-FG2-04ER25627. C.A.B. and G.N.P were supported by the Center for Eukaryotic Structural Genomics NIH/NIGMS grant numbers U54 GM074901-01 and P50 GM064598.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bannen, R.M., Bingman, C.A. & Phillips, G.N. Effect of low-complexity regions on protein structure determination. J Struct Funct Genomics 8, 217–226 (2007). https://doi.org/10.1007/s10969-008-9039-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10969-008-9039-6