Automated Identification of Protein Classification and Detection of Annotation Errors in Protein Databases Using Statistical Approaches

Ning, Kang; Chua, Hon Nian

doi:10.1007/11683568_11

Automated Identification of Protein Classification and Detection of Annotation Errors in Protein Databases Using Statistical Approaches

Kang Ning²⁴ &
Hon Nian Chua²⁵

Conference paper

465 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3886))

Abstract

Because of the importance of proteins in life sciences, biologists have put great effort to elucidate their structures, functions and expression profiles to help us understand their roles in living cells in the past few decades. Currently, protein databases are widely used by biologists. Hence it is critical that the information that researcher work with should be as accurate as possible. However, the sizes of these databases are increasing rapidly, and existing protein databases are already known to contain annotation errors. In this paper, we investigate the reason why protein databases possess mis-annotated sequence data. Then, by using some statistical approaches, we derive a method to automatically filter and assess the reliability of the data from databases. This is important to provide accurate information to researchers and will help reduce further errors in annotation resulting from existed mis-annotated sequence data. Our initial experiments proved our theoretical findings, and show that our methods can effectively detect the mis-annotated sequence data.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Wu, C.H., Huang, H., Yeh, L.S., Barker, W.C.: Protein family classification and functional annotation. Comput. Biol. Chem. 27(1), 37–47 (2003)
Article Google Scholar
Abascal, F., Valencia, A.: Automatic annotation of protein function based on family identification. Proteins 53(3), 683–692 (2003)
Article Google Scholar
Kaplan, N., Vaaknin, A., Linial, M.: PANDORA: keyword-based analysis of protein sets by integration of annotation sources. Nucleic Acids Res 31(19), 5617–5626 (2003)
Article Google Scholar
NCBI BLAST site, http://www.ncbi.nlm.nih.gov/BLAST/
Wu, C.H., Barker, W.C.: A Family Classification Approach To Functional Annotation Of Proteins (2003)
Google Scholar
Fu, W.J., Carroll, R.J., Wang, S.: Estimating misclassification error with small samples via bootstrap cross-validation. Bioinformatics, doi:10.1093/ bioinformatics/bti294
Google Scholar
Wu, C.H., Huang, H., Arminski, L.J., et al.: The Protein Information Resource: An integrated public resource of functional annotation of proteins. Nucleic Acids Research 30(1), 35–37 (2002)
Article Google Scholar
Barker, W.C., Pfeiffer, F., George, D.G.: Superfamily classification in PIR international protein sequence database. Methods in Enzymology 266, 59–71 (1996)
Article Google Scholar
Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., et al.: The Pfam protein families database. Nucl. Acids Res. 30, 276–280 (2002)
Article Google Scholar
Bairoch, A., Apweiler, R.: The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucl. Acids Res. 28, 45–48 (2000)
Article Google Scholar
Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigritst, C.J.A., Hofmann, K., Bairoch, A.: The PROSITE database, its status in 2002. Nucleic Acids Research 30(1), 235–238 (2002)
Article Google Scholar
Attwood, T.K., Blythe, M.J., Flower, D.R., Gaulton, A., Mabey, J.E., Maudling, N., McGregor, L., Mitchell, A.L., Moulton, G., Paine, K., Scordis, P.: PRINTS and PRINTS-S shed light on protein ancestry. Nucleic Acids Research 30, 239–241 (2002)
Article Google Scholar
Lo Conte, L., Brenner, S.E., Hubbard, T.J.P., Chothia, C., Murzin, A.G.: SCOP database in 2002: Refinements accommodate structural genomics. Nucleic Acids Research 30, 264–267 (2002)
Article Google Scholar
Pearl, F.M.G., Martin, N., Bray, J.E., Buchan, D.W.A., Harrison, A.P., Lee, D., Reeves, G.A., Shepherd, A.J., Sillitoe, I., Todd, A.E., Thornton, J.M., Orengo, C.A.: A rapid classification protocol for the CATH domain database to support structural genomics. Nucleic Acids Research 29, 223–227 (2002)
Article Google Scholar
Wu, C.H., Xiao, C., Hou, Z., Huang, H., Barker, W.C.: IProClass: An integrated,comprehensive, and annotated protein classification database. Nucleic Acids Research 29, 52–54 (2001)
Article Google Scholar
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment tool. Jounal of Molecular Biology 215, 403–410
Google Scholar
Pearson, W.R., Liqman, D.J.: Improved Tools for Biological Sequence Analysis. Proc. Natl Acad. Sci. USA 85, 2444–2448 (1988)
Article Google Scholar
Fu, W.J., Carroll, R.J., Wang, S.: Estimating misclassification error with small samples via bootstrap cross-validation. Bioinformatics, doi:10.1093/ bioinformatics/bti294
Google Scholar
Kretschmann, E., Fleischmann, W., Apweiler, R.: Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics 17(10) (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, National University of Singapore, 3 Science Drive 2, 117543, Singapore
Kang Ning
National University of Singapore, Singapore
Hon Nian Chua

Authors

Kang Ning
View author publications
You can also search for this author in PubMed Google Scholar
Hon Nian Chua
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Brain Tumor Research Program, Children’s Memorial Hospital, and Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
Eric G. Bremer
Computer Science Department, Knowledge Management in Bioinformatics, Humbold-Universität zu Berlin, Unter den Linden 6, 10099, Berlin, Germany
Jörg Hakenberg
iXmatch Inc., 5555 West 78th Street Suite E, 55439-2702, Minneapolis, MN, USA
Eui-Hong (Sam) Han
School of Biomedical Sciences, University of Ulster, Cromore Road,, BT52 1SA, Coleraine, Northern Ireland, UK
Daniel Berrar
School of Biomedial Sciences, Bioinformatics Research Group, University of Ulster, Cromore Road, BT52 1SA, Coleraine, Northern Ireland, UK
Werner Dubitzky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ning, K., Chua, H.N. (2006). Automated Identification of Protein Classification and Detection of Annotation Errors in Protein Databases Using Statistical Approaches. In: Bremer, E.G., Hakenberg, J., Han, EH.(., Berrar, D., Dubitzky, W. (eds) Knowledge Discovery in Life Science Literature. KDLL 2006. Lecture Notes in Computer Science(), vol 3886. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11683568_11

Download citation

DOI: https://doi.org/10.1007/11683568_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32809-4
Online ISBN: 978-3-540-32810-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics