Identification and Correction of Erroneous Protein Sequences in Public Databases

Patthy, László

doi:10.1007/978-1-4939-3572-7_9

László Patthy⁴

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1415))

4207 Accesses
1 Citations
1 Altmetric

Abstract

Correct prediction of the structure of protein-coding genes of higher eukaryotes is a difficult task therefore public sequence databases incorporating predicted sequences are increasingly contaminated with erroneous sequences. The high rate of misprediction has serious consequences since it significantly affects the conclusions that may be drawn from genome-scale sequence analyses.

Here we describe the MisPred and FixPred approaches that may help the identification and correction of erroneous sequences. The rationale of these approaches is that a protein sequence is likely to be erroneous if some of its features conflict with our current knowledge about proteins.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Zhang X, Goodsell J, Norgren RB Jr (2012) Limitations of the rhesus macaque draft genome assembly and annotation. BMC Genomics 13:206
Article CAS PubMed PubMed Central Google Scholar
Denton JF, Lugo-Martinez J, Tucker AE et al (2014) Extensive error in the number of genes inferred from draft genome assemblies. PLoS Comput Biol 10(12), e1003998
Article PubMed PubMed Central Google Scholar
Guigó R, Flicek P, Abril JF et al (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 7(Suppl 1):S2.1–S2.31
Article Google Scholar
Harrow J, Nagy A, Reymond A et al (2009) Identifying protein-coding genes in genomic sequences. Genome Biol 10(1):201
Article PubMed PubMed Central Google Scholar
Cunningham F, Amode MR, Barrell D et al (2015) Ensembl 2015. Nucleic Acids Res 43(Database issue):D662–D669
Article PubMed PubMed Central Google Scholar
Souvorov A, Kapustin Y, Kiryutin B et al. (2010) Gnomon – NCBI eukaryotic gene prediction tool. Accessed from http://www.ncbi.nlm.nih.gov/core/assets/genome/files/Gnomon-description.pdf, http://www.ncbi.nlm.nih.gov/genome/guide/gnomon.shtml
Pruitt KD, Tatusova T, Brown GR et al (2012) NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res 40(Database issue):D130–D135
Article CAS PubMed PubMed Central Google Scholar
Tress ML, Martelli PL, Frankish A et al (2007) The implications of alternative splicing in the ENCODE protein complement. Proc Natl Acad Sci U S A 104:5495–5500
Article CAS PubMed PubMed Central Google Scholar
Nagy A, Szláma G, Szarka E et al (2011) Reassessing domain architecture evolution of metazoan proteins: major impact of gene prediction errors. Genes (Basel) 2:449–501
CAS Google Scholar
Nagy A, Patthy L (2011) Reassessing domain architecture evolution of metazoan proteins: the contribution of different evolutionary mechanisms. Genes (Basel) 2:578–598
CAS Google Scholar
Prosdocimi F, Linard B, Pontarotti P, Poch O, Thompson JD (2012) Controversies in modern evolutionary biology: the imperative for error detection and quality control. BMC Genomics 13:5
Article CAS PubMed PubMed Central Google Scholar
Nagy A, Hegyi H, Farkas K et al (2008) Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics 9:353
Article PubMed PubMed Central Google Scholar
Nagy A, Patthy L (2013) MisPred: a resource for identification of erroneous protein sequences in public databases. Database (Oxford). 2013: bat053
Google Scholar
Mott R, Schultz J, Bork P et al (2002) Predicting protein cellular localization using a domain projection method. Genome Res 12:1168–1174
Article CAS PubMed PubMed Central Google Scholar
Tordai H, Nagy A, Farkas K et al (2005) Modules, multidomain proteins and organismic complexity. FEBS J 272:5064–5078
Article CAS PubMed Google Scholar
Wheelan S, Marchler-Bauer A, Bryant S (2000) Domain size distributions can predict domain boundaries. Bioinformatics 16:613–618
Article CAS PubMed Google Scholar
Wolf Y, Madej T, Babenko V et al (2007) Long-term trends in evolution of indels in protein sequences. BMC Evol Biol 7:19
Article PubMed PubMed Central Google Scholar
Finn RD, Bateman A, Clements J et al (2014) Pfam: the protein families database. Nucleic Acids Res 42(Database issue):D222–D230
Article CAS PubMed PubMed Central Google Scholar
Finn RD, Clements J, Eddy SR (2011) HMMER web server: interactive sequence similarity searching. Nucleic Acids Res 39:W29–W37
Article CAS PubMed PubMed Central Google Scholar
Marchler-Bauer A, Derbyshire MK, Gonzales NR et al (2015) CDD: NCBI’s conserved domain database. Nucleic Acids Res 43(Database issue):D222–D226
Article PubMed PubMed Central Google Scholar
Hiller K, Grote A, Scheer M et al (2004) PrediSi: prediction of signal peptides and their cleavage positions. Nucleic Acids Res 32:W375–W379
Article CAS PubMed PubMed Central Google Scholar
Bendtsen JD, Nielsen H, von Heijne G et al (2004) Improved prediction of signal peptides: SignalP 3.0. J Mol Biol 340:783–795
Article PubMed Google Scholar
Krogh AL, Larsson B, von Heijne G et al (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305:567–580
Article CAS PubMed Google Scholar
Käll L, Krogh A, Sonnhammer EL (2007) Advantages of combined transmembrane topology and signal peptide prediction—the Phobius web server. Nucleic Acids Res 35:W429–W432
Article PubMed PubMed Central Google Scholar
Kronegg J, Buloz D (1999) Detection/prediction of GPI cleavage site (GPI-anchor) in a protein (DGPI). Accessed from http://dgpi.pathbot.com/
Kent WJ (2002) BLAT– the BLAST-like alignment tool. Genome Res 12:656–664
Article CAS PubMed PubMed Central Google Scholar
Bendtsen J, Jensen L, Blom N et al (2004) Feature-based prediction of non-classical and leaderless protein secretion. Protein Eng Des Sel 17:349–356
Article CAS PubMed Google Scholar
Nagy A, Patthy L (2014) FixPred: a resource for correction of erroneous protein sequences. Database (Oxford). 2014: bau032
Google Scholar
UniProt Consortium (2014) Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res 42(Database issue):D191–D198
Google Scholar
Altschul SF, Madden TL, Schäffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Article CAS PubMed PubMed Central Google Scholar
Benson DA, Clark K, Karsch-Mizrachi I et al (2015) GenBank. Nucleic Acids Res 43(Database issue):D30–D35
Article PubMed PubMed Central Google Scholar
Rice P, Longden I, Bleasby A (2000) EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet 16:276–277
Article CAS PubMed Google Scholar
Birney E, Clamp M, Durbin R (2004) GeneWise and Genomewise. Genome Res 4:988–995
Article Google Scholar
Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268:78–94
Article CAS PubMed Google Scholar
Stanke M, Steinkamp R, Waack S et al (2004) AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res 32:W309–W312
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgement

This work was supported by grants from the National Office for Research and Technology of Hungary (TECH_09_A1-FixPred9) and the Hungarian Scientific Research Fund (OTKA 101201).

Author information

Authors and Affiliations

Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, 286, Budapest, H-1519, Hungary
László Patthy

Authors

László Patthy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to László Patthy .

Editor information

Editors and Affiliations

Max F. Perutz Laboratories GmbH, Universität Wien, Wien, Austria
Oliviero Carugo
Technology and Research (A*STAR), Agency for Science, Singapore, Singapore
Frank Eisenhaber

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Patthy, L. (2016). Identification and Correction of Erroneous Protein Sequences in Public Databases. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 1415. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3572-7_9

Download citation

DOI: https://doi.org/10.1007/978-1-4939-3572-7_9
Published: 27 April 2016
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-3570-3
Online ISBN: 978-1-4939-3572-7
eBook Packages: Springer Protocols

Publish with us

Policies and ethics