Skip to main content

Identification and Correction of Erroneous Protein Sequences in Public Databases

  • Protocol
  • First Online:
Data Mining Techniques for the Life Sciences

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1415))

Abstract

Correct prediction of the structure of protein-coding genes of higher eukaryotes is a difficult task therefore public sequence databases incorporating predicted sequences are increasingly contaminated with erroneous sequences. The high rate of misprediction has serious consequences since it significantly affects the conclusions that may be drawn from genome-scale sequence analyses.

Here we describe the MisPred and FixPred approaches that may help the identification and correction of erroneous sequences. The rationale of these approaches is that a protein sequence is likely to be erroneous if some of its features conflict with our current knowledge about proteins.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Zhang X, Goodsell J, Norgren RB Jr (2012) Limitations of the rhesus macaque draft genome assembly and annotation. BMC Genomics 13:206

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Denton JF, Lugo-Martinez J, Tucker AE et al (2014) Extensive error in the number of genes inferred from draft genome assemblies. PLoS Comput Biol 10(12), e1003998

    Article  PubMed  PubMed Central  Google Scholar 

  3. Guigó R, Flicek P, Abril JF et al (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 7(Suppl 1):S2.1–S2.31

    Article  Google Scholar 

  4. Harrow J, Nagy A, Reymond A et al (2009) Identifying protein-coding genes in genomic sequences. Genome Biol 10(1):201

    Article  PubMed  PubMed Central  Google Scholar 

  5. Cunningham F, Amode MR, Barrell D et al (2015) Ensembl 2015. Nucleic Acids Res 43(Database issue):D662–D669

    Article  PubMed  PubMed Central  Google Scholar 

  6. Souvorov A, Kapustin Y, Kiryutin B et al. (2010) Gnomon – NCBI eukaryotic gene prediction tool. Accessed from http://www.ncbi.nlm.nih.gov/core/assets/genome/files/Gnomon-description.pdf, http://www.ncbi.nlm.nih.gov/genome/guide/gnomon.shtml

  7. Pruitt KD, Tatusova T, Brown GR et al (2012) NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res 40(Database issue):D130–D135

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Tress ML, Martelli PL, Frankish A et al (2007) The implications of alternative splicing in the ENCODE protein complement. Proc Natl Acad Sci U S A 104:5495–5500

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Nagy A, Szláma G, Szarka E et al (2011) Reassessing domain architecture evolution of metazoan proteins: major impact of gene prediction errors. Genes (Basel) 2:449–501

    CAS  Google Scholar 

  10. Nagy A, Patthy L (2011) Reassessing domain architecture evolution of metazoan proteins: the contribution of different evolutionary mechanisms. Genes (Basel) 2:578–598

    CAS  Google Scholar 

  11. Prosdocimi F, Linard B, Pontarotti P, Poch O, Thompson JD (2012) Controversies in modern evolutionary biology: the imperative for error detection and quality control. BMC Genomics 13:5

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Nagy A, Hegyi H, Farkas K et al (2008) Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics 9:353

    Article  PubMed  PubMed Central  Google Scholar 

  13. Nagy A, Patthy L (2013) MisPred: a resource for identification of erroneous protein sequences in public databases. Database (Oxford). 2013: bat053

    Google Scholar 

  14. Mott R, Schultz J, Bork P et al (2002) Predicting protein cellular localization using a domain projection method. Genome Res 12:1168–1174

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Tordai H, Nagy A, Farkas K et al (2005) Modules, multidomain proteins and organismic complexity. FEBS J 272:5064–5078

    Article  CAS  PubMed  Google Scholar 

  16. Wheelan S, Marchler-Bauer A, Bryant S (2000) Domain size distributions can predict domain boundaries. Bioinformatics 16:613–618

    Article  CAS  PubMed  Google Scholar 

  17. Wolf Y, Madej T, Babenko V et al (2007) Long-term trends in evolution of indels in protein sequences. BMC Evol Biol 7:19

    Article  PubMed  PubMed Central  Google Scholar 

  18. Finn RD, Bateman A, Clements J et al (2014) Pfam: the protein families database. Nucleic Acids Res 42(Database issue):D222–D230

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Finn RD, Clements J, Eddy SR (2011) HMMER web server: interactive sequence similarity searching. Nucleic Acids Res 39:W29–W37

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Marchler-Bauer A, Derbyshire MK, Gonzales NR et al (2015) CDD: NCBI’s conserved domain database. Nucleic Acids Res 43(Database issue):D222–D226

    Article  PubMed  PubMed Central  Google Scholar 

  21. Hiller K, Grote A, Scheer M et al (2004) PrediSi: prediction of signal peptides and their cleavage positions. Nucleic Acids Res 32:W375–W379

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Bendtsen JD, Nielsen H, von Heijne G et al (2004) Improved prediction of signal peptides: SignalP 3.0. J Mol Biol 340:783–795

    Article  PubMed  Google Scholar 

  23. Krogh AL, Larsson B, von Heijne G et al (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305:567–580

    Article  CAS  PubMed  Google Scholar 

  24. Käll L, Krogh A, Sonnhammer EL (2007) Advantages of combined transmembrane topology and signal peptide prediction—the Phobius web server. Nucleic Acids Res 35:W429–W432

    Article  PubMed  PubMed Central  Google Scholar 

  25. Kronegg J, Buloz D (1999) Detection/prediction of GPI cleavage site (GPI-anchor) in a protein (DGPI). Accessed from http://dgpi.pathbot.com/

  26. Kent WJ (2002) BLAT– the BLAST-like alignment tool. Genome Res 12:656–664

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Bendtsen J, Jensen L, Blom N et al (2004) Feature-based prediction of non-classical and leaderless protein secretion. Protein Eng Des Sel 17:349–356

    Article  CAS  PubMed  Google Scholar 

  28. Nagy A, Patthy L (2014) FixPred: a resource for correction of erroneous protein sequences. Database (Oxford). 2014: bau032

    Google Scholar 

  29. UniProt Consortium (2014) Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res 42(Database issue):D191–D198

    Google Scholar 

  30. Altschul SF, Madden TL, Schäffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Benson DA, Clark K, Karsch-Mizrachi I et al (2015) GenBank. Nucleic Acids Res 43(Database issue):D30–D35

    Article  PubMed  PubMed Central  Google Scholar 

  32. Rice P, Longden I, Bleasby A (2000) EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet 16:276–277

    Article  CAS  PubMed  Google Scholar 

  33. Birney E, Clamp M, Durbin R (2004) GeneWise and Genomewise. Genome Res 4:988–995

    Article  Google Scholar 

  34. Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268:78–94

    Article  CAS  PubMed  Google Scholar 

  35. Stanke M, Steinkamp R, Waack S et al (2004) AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res 32:W309–W312

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgement

This work was supported by grants from the National Office for Research and Technology of Hungary (TECH_09_A1-FixPred9) and the Hungarian Scientific Research Fund (OTKA 101201).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to László Patthy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media New York

About this protocol

Cite this protocol

Patthy, L. (2016). Identification and Correction of Erroneous Protein Sequences in Public Databases. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 1415. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3572-7_9

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-3572-7_9

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-3570-3

  • Online ISBN: 978-1-4939-3572-7

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics