Abstract
Correct prediction of the structure of protein-coding genes of higher eukaryotes is a difficult task therefore public sequence databases incorporating predicted sequences are increasingly contaminated with erroneous sequences. The high rate of misprediction has serious consequences since it significantly affects the conclusions that may be drawn from genome-scale sequence analyses.
Here we describe the MisPred and FixPred approaches that may help the identification and correction of erroneous sequences. The rationale of these approaches is that a protein sequence is likely to be erroneous if some of its features conflict with our current knowledge about proteins.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zhang X, Goodsell J, Norgren RB Jr (2012) Limitations of the rhesus macaque draft genome assembly and annotation. BMC Genomics 13:206
Denton JF, Lugo-Martinez J, Tucker AE et al (2014) Extensive error in the number of genes inferred from draft genome assemblies. PLoS Comput Biol 10(12), e1003998
Guigó R, Flicek P, Abril JF et al (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 7(Suppl 1):S2.1–S2.31
Harrow J, Nagy A, Reymond A et al (2009) Identifying protein-coding genes in genomic sequences. Genome Biol 10(1):201
Cunningham F, Amode MR, Barrell D et al (2015) Ensembl 2015. Nucleic Acids Res 43(Database issue):D662–D669
Souvorov A, Kapustin Y, Kiryutin B et al. (2010) Gnomon – NCBI eukaryotic gene prediction tool. Accessed from http://www.ncbi.nlm.nih.gov/core/assets/genome/files/Gnomon-description.pdf, http://www.ncbi.nlm.nih.gov/genome/guide/gnomon.shtml
Pruitt KD, Tatusova T, Brown GR et al (2012) NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res 40(Database issue):D130–D135
Tress ML, Martelli PL, Frankish A et al (2007) The implications of alternative splicing in the ENCODE protein complement. Proc Natl Acad Sci U S A 104:5495–5500
Nagy A, Szláma G, Szarka E et al (2011) Reassessing domain architecture evolution of metazoan proteins: major impact of gene prediction errors. Genes (Basel) 2:449–501
Nagy A, Patthy L (2011) Reassessing domain architecture evolution of metazoan proteins: the contribution of different evolutionary mechanisms. Genes (Basel) 2:578–598
Prosdocimi F, Linard B, Pontarotti P, Poch O, Thompson JD (2012) Controversies in modern evolutionary biology: the imperative for error detection and quality control. BMC Genomics 13:5
Nagy A, Hegyi H, Farkas K et al (2008) Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics 9:353
Nagy A, Patthy L (2013) MisPred: a resource for identification of erroneous protein sequences in public databases. Database (Oxford). 2013: bat053
Mott R, Schultz J, Bork P et al (2002) Predicting protein cellular localization using a domain projection method. Genome Res 12:1168–1174
Tordai H, Nagy A, Farkas K et al (2005) Modules, multidomain proteins and organismic complexity. FEBS J 272:5064–5078
Wheelan S, Marchler-Bauer A, Bryant S (2000) Domain size distributions can predict domain boundaries. Bioinformatics 16:613–618
Wolf Y, Madej T, Babenko V et al (2007) Long-term trends in evolution of indels in protein sequences. BMC Evol Biol 7:19
Finn RD, Bateman A, Clements J et al (2014) Pfam: the protein families database. Nucleic Acids Res 42(Database issue):D222–D230
Finn RD, Clements J, Eddy SR (2011) HMMER web server: interactive sequence similarity searching. Nucleic Acids Res 39:W29–W37
Marchler-Bauer A, Derbyshire MK, Gonzales NR et al (2015) CDD: NCBI’s conserved domain database. Nucleic Acids Res 43(Database issue):D222–D226
Hiller K, Grote A, Scheer M et al (2004) PrediSi: prediction of signal peptides and their cleavage positions. Nucleic Acids Res 32:W375–W379
Bendtsen JD, Nielsen H, von Heijne G et al (2004) Improved prediction of signal peptides: SignalP 3.0. J Mol Biol 340:783–795
Krogh AL, Larsson B, von Heijne G et al (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305:567–580
Käll L, Krogh A, Sonnhammer EL (2007) Advantages of combined transmembrane topology and signal peptide prediction—the Phobius web server. Nucleic Acids Res 35:W429–W432
Kronegg J, Buloz D (1999) Detection/prediction of GPI cleavage site (GPI-anchor) in a protein (DGPI). Accessed from http://dgpi.pathbot.com/
Kent WJ (2002) BLAT– the BLAST-like alignment tool. Genome Res 12:656–664
Bendtsen J, Jensen L, Blom N et al (2004) Feature-based prediction of non-classical and leaderless protein secretion. Protein Eng Des Sel 17:349–356
Nagy A, Patthy L (2014) FixPred: a resource for correction of erroneous protein sequences. Database (Oxford). 2014: bau032
UniProt Consortium (2014) Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res 42(Database issue):D191–D198
Altschul SF, Madden TL, Schäffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Benson DA, Clark K, Karsch-Mizrachi I et al (2015) GenBank. Nucleic Acids Res 43(Database issue):D30–D35
Rice P, Longden I, Bleasby A (2000) EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet 16:276–277
Birney E, Clamp M, Durbin R (2004) GeneWise and Genomewise. Genome Res 4:988–995
Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268:78–94
Stanke M, Steinkamp R, Waack S et al (2004) AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res 32:W309–W312
Acknowledgement
This work was supported by grants from the National Office for Research and Technology of Hungary (TECH_09_A1-FixPred9) and the Hungarian Scientific Research Fund (OTKA 101201).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media New York
About this protocol
Cite this protocol
Patthy, L. (2016). Identification and Correction of Erroneous Protein Sequences in Public Databases. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 1415. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3572-7_9
Download citation
DOI: https://doi.org/10.1007/978-1-4939-3572-7_9
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-3570-3
Online ISBN: 978-1-4939-3572-7
eBook Packages: Springer Protocols