Protein Data Integration Problem

  • Amandeep S. Sidhu
  • Matthew Bellgard
Part of the Studies in Computational Intelligence book series (SCI, volume 224)


In this chapter, we consider the challenges of information integration in proteomics from the prospective of researchers using information technology as an integral part of their discovery process. Specifically, data integration, meta-data specification, data provenance and data quality, and ontology are discussed here. These are the fundamental problems that need to be solved by the bioinformatics community so that modern information technology can have a deeper impact on the progress of biological discovery.


Gene Ontology Protein Data Bank Generic Concept Control Vocabulary Unify Medical Language System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Appel, R.D., Bairoch, A., Hochstrasser, D.F.: A new generation of information retrieval tools for biologists: the example of the expasy www server. Trends in Biochemical Sciences 19, 258–260 (1994)CrossRefGoogle Scholar
  2. Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Natale, D.A., O’Donovan, C., Redaschi, N., Yeh, L.S.: UniProt: The Universal Protein knowledgebase. Nucleic Acids Research 32, 115–119 (2004)CrossRefGoogle Scholar
  3. Ashburner, M.: FlyBase. Genome News 13, 19–20 (1993)Google Scholar
  4. Ashburner, M., Ball, C.A., Blake, J.A., Butler, H., Cherry, J.C., Corradi, J., Dolinski, K.: Creating the Gene Ontology Resource: Design and Implementation. Genome Research 11, 1425–1433 (2001)CrossRefGoogle Scholar
  5. Baclawski, K., Cigna, J., Kokar, M.M., Magner, P., Indurkhya, B.: Knowledge Representation and Indexing Using the Unified Medical Language System. In: Pacific Symposium on Biocomputing, PSB Online Proceedings. Honolulu, Hawaii (2000)Google Scholar
  6. Bada, M., Hunter, L.: Enrichment of OBO Ontologies. Journal of Biomedical Informatics (July 26, 2006) (E-publication ahead of print)Google Scholar
  7. Bairoch, A., Bucher, P., Hofmann, K.: The PROSITE database, its status in 1995. Nucleic Acids Research, 189–196 (1996)Google Scholar
  8. Bairoch, A., Bucher, P., Hofmann, K.: The PROSITE database, its status in 1997. Nucleic Acids Research 25, 217–221 (1997)CrossRefGoogle Scholar
  9. Baker, P.G., Brass, A., Bechhofer, S., Goble, C., Paton, N., Stevens, R.: TAMBIS - transparent access to multiple bioinformatics information sources. In: Glasgow, J., Littlejohn, T., Major, F., Lathrop, R., Sankoff, D., Sensen, C.W. (eds.) 6th International Conference on Intelligent Systems for Molecular Biology. AAAI, Montreal (1998)Google Scholar
  10. Baker, P.G., Goble, C.A., Bechhofer, S., Paton, N.W., Stevens, R., Brass, A.: An Ontology for Bioinformatics Applications. Bioinformatics 15, 510–520 (1999)CrossRefGoogle Scholar
  11. Benson, D., Karsch-Mizrachi, I., Lipman, D., Ostell, J., Wheeler, D.: Genbank. Nucleic Acids Research 34, 16–20 (2006)CrossRefGoogle Scholar
  12. Blake, J.A., Eppig, J.T., Richardson, J.E., Davisson, M.T.: The Mouse Genome Database (MGD): a community resource. Status and enhancements. The Mouse Genome Informatics Group. Nucleic Acids Research 26, 130–137 (1998)CrossRefGoogle Scholar
  13. Bodenreider, O.: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research 32, 267–270 (2004)CrossRefGoogle Scholar
  14. Brenner, S.E.: World Wide Web and molecular biology. Science 268, 622–623 (1995)CrossRefGoogle Scholar
  15. Buneman, P., Davidson, S., Hart, K., Overton, C., Wong, L.: A Data Transformation System for Biological Data Sources. In: 21st International Conference on Very Large Data Bases (VLDB 1995). Morgan Kaufmann, Zurich (1995)Google Scholar
  16. Buneman, P., Davidson, S., Hillebrand, G., Suciu, D.: A query language and optimization techniques for unstructured data. In: Widom, J. (ed.) 1996 ACM SIGMOD international Conference on Management of Data. ACM Press, Montreal (1996)Google Scholar
  17. Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns, D., Harte, N., Lopez, R., Apweiler, R.: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Research 32, 262–266 (2004)CrossRefGoogle Scholar
  18. Chen, I.A., Markowitz, V.M.: An overview of the Object Protocol Model (OPM) and the OPM data management tools. Information Systems 20, 393–418 (1995)CrossRefGoogle Scholar
  19. Cochrane, G., Aldebert, P., Althorpe, N., Andersson, M., Baker, W., Baldwin, A., Bates, K., Bhattacharyya, S., Browne, P., Van Den Broek, A., Castro, M., Duggan, K., Eberhardt, R., Faruque, N., Gamble, J., Kanz, C., Kulikova, T., Lee, C., Leinonen, R., Lin, Q., Lombard, V., Lopez, R., Mchale, M., Mcwilliam, H., Mukherjee, G., Nardone, F., Pastor, M.P.G., Sobhany, S., Stoehr, P., Tzouvara, K., Vaughan, R., Wu, D., Zhu, W., Apweiler, R.: EMBL Nucleotide Sequence Database: developments in 2005. Nucleic Acids Research 34, 10–15 (2005)CrossRefGoogle Scholar
  20. Collins, F.S., Morgan, M., Patrinos, A.: The Human Genome Project: Lessons from Large-Scale Biology. Science 300, 286–290 (2003)CrossRefGoogle Scholar
  21. Davidson, S., Crabtree, J., Brunk, B., Schug, J., Tannen, V., Overton, C., Stoeckert, C.: K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources. IBM Systems Journal 40, 512–531 (2001)CrossRefGoogle Scholar
  22. Denny, J.C., Smithers, J.D., Miller, R.: "Understanding" medical school curriculum content using Knowledge Map. Journal of the American Medical Informatics Association 10, 351–362 (2003)CrossRefGoogle Scholar
  23. Discala, C., Ninnin, M., Achard, F., Barillot, E., Vaysseix, G.: DBCat: a catalog of biological databases. Nucleic Acids Research 27, 10–11 (1999)CrossRefGoogle Scholar
  24. Etzold, T., Argos, P.: SRS: An Indexing and Retrieval Tool for Flat File Data Libraries. Computer Application of Biosciences 9, 49–57 (1993)Google Scholar
  25. Fan, W.: Path Constraints for Databases with or without Schemas. University of Pennsylvania, Philadelphia (1999)Google Scholar
  26. Flanders, D.J., Weng, S., Petel, F.X., Cherry, J.M.: AtDB, the Arabidopsis thaliana Database, and graphical-web-display of progress by the Arabidopsis Genome Initiative. Nucleic Acids Research 26, 80–84 (1998)CrossRefGoogle Scholar
  27. Fraser, A.G., Marcotte, E.M.: A probabilistic view of gene function. Nature Genetics 36, 559–564 (2004)CrossRefGoogle Scholar
  28. Frazier, M.E., Johnson, G.M., Thomassen, D.G., Oliver, C.E., Patrinos, A.: Realizing the Potential of Genome Revolution: The Genomes to Life Program. Science 300, 290–293 (2003a)CrossRefGoogle Scholar
  29. Frazier, M.E., Thomassen, D.G., Patrinos, A., Johnson, G.M., Oliver, C.E., Uberbacher, E.: Setting Up the Pace of Discovery: the Genomes to Life Program. In: 2nd IEEE Computer Society Bioinformatics Conference (CSB 2003). IEEE CS Press, Stanford (2003)Google Scholar
  30. Fujibuchi, W., Goto, S., Migimatsu, H., Uchiyama, I., Ogiwara, A., Akiyama, Y., Kanehisa, M.: DBGET/LinkDB: an Integrated Database Retrieval System. In: Pacific Symposium of Biocomputing, PSB Electronic Proceedings, Hawaii (1998)Google Scholar
  31. George, D.G., Mewes, H.-W., Kihara, H.: A standardized format for sequence data exchange. Protein Seq. Data Anal. 1, 27–29 (1987)Google Scholar
  32. George, D.G., Orcutt, B.C., Mewes, H.-W., Tsugita, A.: An object-oriented sequence database definition language (sddl). Protein Seq. Data Anal. 5, 357–399 (1993)Google Scholar
  33. Goble, C.A., Stevens, R., Ng, G., Bechhofer, S., Paton, N.W., Baker, P.G., Peim, M., Brass, A.: Transparent access to multiple bioinformatics information sources. IBM Systems Journal 40, 532–551 (2001)Google Scholar
  34. Gray, P.M.D., Paton, N.W., Kemp, G.J.L., Fothergill, J.E.: An object-oriented database for protein structure analysis. Protein Engineering 3, 235–243 (1990)CrossRefGoogle Scholar
  35. Gruber, T.R.: A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition 5, 199–220 (1993)CrossRefGoogle Scholar
  36. Haas, L., Schwarz, P., Kodali, P., Kotlar, E., Rice, J., Swope, W.: DiscoveryLink: A system for integrated access to life sciences data sources. IBM Systems Journal 40, 489–511 (2001)Google Scholar
  37. Hafner, C.D., Fridman, N.: Ontological foundations for biology knowledge models. In: 4th International Conference on Intelligent Systems for Molecular Biology. AAAI, St. Louis (1996)Google Scholar
  38. Harger, C., Skupski, M., Bingham, J., Farmer, A., Hoisie, S., Hraber, P., Kiphart, D., Krakowski, L., Mcleod, M., Schwertfeger, S., Seluja, S., Siepel, A., Singh, G., Stamper, D., Steadman, P., Thayer, N., Thompson, R., Wargo, P., Waugh, M., Zhuang, J.J., Schad, P.A.: The Genome Sequence DataBase (GSDB): improving data quality and data access. Nucleic Acids Research 26, 21–26 (1998)CrossRefGoogle Scholar
  39. Harte, N., Silventoinen, V., Quevillon, E., Robinson, S., Kallio, K., Fustero, X., Patel, P., Jokinen, P., Lopez, R.: Public webbased services from the European Bioinformatics Institute. Nucleic Acids Research 32, W3–W9 (2004)CrossRefGoogle Scholar
  40. Hermjakob, H., Montecchi-Palazzi, L., Lewington, C., Mudali, S., Kerrien, S., Orchard, S., Vingron, M., Roechert, B., Roepstorff, P., Valencia, A.: IntAct-an open source molecular interaction database. Nucleic Acids Research 32, 452–455 (2004)CrossRefGoogle Scholar
  41. Huysmans, M., Richelle, J., Wodak, S.J.: SESAM: a relational database for structure and sequence of macromolecules. Proteins 11, 59–76 (1991)CrossRefGoogle Scholar
  42. Jenssen, T.K., Laegreid, A., Komorowski, J., Hovig, E.: A literature network of human genes for high-throughput analysis of gene expression. Nature Genetics 28, 21–28 (2001)CrossRefGoogle Scholar
  43. Kanehisa, M., Fickett, J.W., Goad, W.B.: A relational database system for the maintenance and verification of the Los Alamos sequence library. Nucleic Acids Research 12, 149–158 (1984)CrossRefGoogle Scholar
  44. Karp, P.: Database links are a foundation for interoperability. Trends in Biotechnology 14, 273–279 (1996)CrossRefGoogle Scholar
  45. King, O.D., Foulger, R.E., Dwight, S., White, J., Roth, F.P.: Predicting gene function from patterns of annotation. Genome Research 13, 896–904 (2003)CrossRefGoogle Scholar
  46. Letovsky, S.I.: Beyond the information maze. Journal of Computational Biology 2, 539–546 (1995)CrossRefGoogle Scholar
  47. Lewis, S.E.: Gene Ontology: looking backwards and forwards. Genome Biology 6, 103.1–103.4 (2004)Google Scholar
  48. Li, Q., Shilane, P., Noy, N.F., Musen, M.A.: Ontology acquisition from on-line knowledge sources. In: AMIA 2000 Annual Symposium, Los Angeles, CA (2000)Google Scholar
  49. Lindberg, C.: The Unified Medical Language System (UMLS) of the National Library of Medicine. Journal of American Medical Record Association 61, 40–42 (1990)Google Scholar
  50. Lindberg, D.A., Humphreys, B.L., Mccray, A.T.: The Unified Medical Language System. Methods of information in medicine 32, 281–291 (1993)Google Scholar
  51. Lindsley, D.L., Zimm, G.G.: The genome of Drosophila melanogaster. Academic Press, San Diego (1992)Google Scholar
  52. Mani, I., Hu, Z., Hu, W.: PRONTO: A Large-scale Machine-induced Protein Ontology. In: 2nd Standards and Ontologies for Functional Genomics Conference (SOFG 2004), UK (2004)Google Scholar
  53. Markowitz, V.M., Ritter, O.: Characterizing heterogeneous molecular biology data systems. Journal of Computational Biology 2, 547–556 (1995)CrossRefGoogle Scholar
  54. Mckusick, V.A.: Mendelian Inheritance in Man. In: A Catalog of Human Genes and Genetic Disorders. Johns Hopkins University Press, Baltimore (1998)Google Scholar
  55. Miyazaki, S., Sugawara, H., Gojobori, T., Tateno, Y.: DNA Databank of Japan (DDBJ) in XML. Nucleic Acids Research 31, 13–16 (2003)CrossRefGoogle Scholar
  56. Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bradley, P., Bork, P., Bucher, P., Cerutti, L.: InterPro, progress and status in 2005. Nucleic Acids Research 33, 201–205 (2005)CrossRefGoogle Scholar
  57. Nelson, S.J., Johnston, D., Humphreys, B.L.: Relationships in Medical Subject Headings. In: Bean, C.A., Green, R. (eds.) Relationships in the organization of knowledge. Kluwer Academic Publishers, New York (2001)Google Scholar
  58. Nestorov, S., Ullman, J., Wiener, J., Chawathe, S.: Representative objects: concise representations of semistructured, hierarchical data. In: 13th International Conference on Data Engineering. IEEE CS Press, Birmingham (1997)Google Scholar
  59. O’Neil, M., Payne, C., Read, J.: Read Codes Version 3: a user led terminology. Methods of information in medicine 34, 187–192 (1995)Google Scholar
  60. Ohkawa, H., Ostell, J., Bryant, S.: MMDB: an ASN.1 specification for macromolecular structure. In: 3rd International Conference on Intelligent Systems for Molecular Biology. AAAI, Cambridge (1995)Google Scholar
  61. Ostell, J.: GenInfo ASN.1 Syntax: Sequences. NCBI Technical Report Series. National Library of Medicine, NIH (1990)Google Scholar
  62. Overton, G.C., Aaronson, J.S., Haas, J., Adams, J.: Qgb: a system for querying sequence database fields and features. Journal of Computational Biology 1, 3–14 (1994)Google Scholar
  63. Pennisi, E.: Genome Data Shake Tree of Life. Science 280, 672–674 (1998)CrossRefGoogle Scholar
  64. Pongor, S.: Novel databases for molecular biology. Nature 332, 24 (1998)CrossRefGoogle Scholar
  65. Rawlings, C.J.: Designing databases for molecular biology. Nature 334, 447 (1998)Google Scholar
  66. Rebhan, M., Chalifa-Caspi, V., Prilusky, J., Lancet, D.: GeneCards: encyclopedia for Genes, Proteins, and Diseases. Weizmann Institute of Science, Bioinformatics Unit and Genome Center Rehovot, Israel, (1997)Google Scholar
  67. Rector, A.L., Bechhofer, S., Goble, C.A., Horrocks, I., Nowlan, W.A., Solomon, W.D.: The GRAIL Concept Modelling Language for Medical Terminology. Artificial Intelligence in Medicine 9, 139–171 (1997)CrossRefGoogle Scholar
  68. Roberts, R.J., Macelis, D.: REBASE - restriction enzymes and methylases. Nucleic Acids Research 26, 338–350 (1998)CrossRefGoogle Scholar
  69. Schuler, G.D., Boguski, M.S., Stewart, E.A., Stein, L.D., Gyapay, G., Rice, K., White, R.E., Rodriguez-Tome, P., Aggarwal, A., Bajorek, E., Bentolila, S., Birren, B.B., Butler, A., Castle, A.B., Chiannilkulchai, N., Chu, A., Clee, C., Cowles, S., Day, P.J.R., Dibling, T., Drouot, N., Dunham, I., Duprat, S., East, C., Edwards, C., Fan, J.-B., Fang, N., Fizames, C., Garrett, C., Green, L., Hadley, D., Harris, M., Harrison, A.P., Brady, S., Hicks, A., Holloway, E., Hui, L., Hussain, S., Louis-Dit-Sully, C., Ma, J., Macgilvery, A., Mader, C., Maratukulam, A., Matise, T.C., Mckusick, K.B., Morissette, J., Mungall, A., Muselet, D., Nusbaum, D.: A gene map of the human genome. Science 274, 540–546 (1996a)CrossRefGoogle Scholar
  70. Schuler, G.D., Epstein, J.A., Ohkawa, H., Kans, J.A.: Entrez: molecular biology database and retrieval system. Methods in Enzymology 266, 141–162 (1996b)CrossRefGoogle Scholar
  71. Schulze-Kremer, S.: Ontologies for Molecular Biology. In: Pacific Symposium of Biocomputing, PSB 1998 Electronic Proceedings, Hawaii (1998)Google Scholar
  72. Shomer, B., Harper, R.A., Cameron, G.N.: Information services of the European Bioinformatics Institute. Methods in Enzymology 266, 3–27 (1996)CrossRefGoogle Scholar
  73. Sidhu, A.S., Dillon, T.S., Chang, E.: Protein Ontology. In: Chen, J., Sidhu, A.S. (eds.) Biological Database Modeling. Artech House, New York (2007)Google Scholar
  74. Sidhu, A.S., Dillon, T.S., Setiawan, H., Sidhu, B.S.: Comprehensive Protein Database Representation. In: Gramada, A., Bourne, P.E. (eds.) 8th International Conference on Research in Computational Molecular Biology 2004 (RECOMB 2004). ACM Press, San Diego (2004a)Google Scholar
  75. Sidhu, A.S., Dillon, T.S., Sidhu, B.S., Setiawan, H.: A Unified Representation of Protein Structure Databases. In: Reddy, M.S., Khanna, S. (eds.) Biotechnological Approaches for Sustainable Development. Allied Publishers, India (2004b)Google Scholar
  76. Sidhu, A.S., Dillon, T.S., Sidhu, B.S., Setiawan, H.: An XML based semantic protein map. In: Zanasi, A., Ebecken, N.F.F., Brebbia, C.A. (eds.) 5th International Conference on Data Mining, Text Mining and their Business Applications (Data Mining 2004). WIT Press, Malaga (2004c)Google Scholar
  77. Sonnhammer, E.L.L., Eddy, S.R., Birney, E., Bateman, A., Durbin, R.: Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Research 26, 320–322 (1998)CrossRefGoogle Scholar
  78. Spackman, K.A.: SNOMED RT. College of American Pathologists, Northfield (2000)Google Scholar
  79. Spackman, K.A., Campbell, K.E., Cote, R.A.: SNOMED RT: a reference terminology for health care. In: Masys, D.R. (ed.) AMIA 1997 Annual Fall Symposium, Nashville, TN (1997)Google Scholar
  80. Stoesser, G., Baker, W., Van Den Broek, A., Garcia-Pastor, M., Kanz, C., Kulikova, T.: The EMBL Nucleotide Sequence Database: Major new developments. Nucleic Acids Research 31, 17–22 (2003)CrossRefGoogle Scholar
  81. Tisdall, J.D.: Mastering Perl for bioinformatics. O’Reilly, Sebastopol (2003)Google Scholar
  82. Trombert-Paviot, B., Rodrigues, J.M., Rogers, J.E., Baud, R., Van Der Haring, E., Rassinoux, A.M., Abrial, V., Clavel, L., Idir, H.: GALEN: a third generation terminology tool to support a multipurpose national coding system for surgical procedures. International Journal of Medical Informatics 58–59, 71–85 (2000)Google Scholar
  83. Bray, T., Paoli, J., Sperberg-Mcqueen, C.M., Maler, E., Yergeau, F. (eds.): W3C-XML, Extensible Markup Language (XML) 1.0. W3C Recommendation, August 16, 2006; edited in place September 29, 2006. 4th edn., World Wide Web Consortium (2006)Google Scholar
  84. Fallside, D.C., Walmsley, P. (eds.): W3C-XMLSCHEMA, XML Schema Part 0: Primer. W3C Recommendation, 2nd edn., October 28, 2004. World Wide Web Consortium (2004) Google Scholar
  85. Wang, A.: Mapping Between SNOMED RT and Clinical Terms Version 3: A Key Component of the SNOMED CT Development Process. In: Bakken, S. (ed.) AMIA Annual Fall Symposium 2001 (2001)Google Scholar
  86. Westbrook, J., Ito, N., Nakamura, H., Henrick, K., Berman, H.M.: PDBML: the representation of archival macromolecular structure data in XML. Bioinformatics 21, 988–992 (2005)CrossRefGoogle Scholar
  87. Whetzel, P.L., Parkinson, H., Causton, H.C., Fan, L., Fostel, J., Fragoso, G., Game, L., Heiskanen, M., Morrison, N., Rocca-Serra, P., Sansone, S., Taylor, C., White, J., Stoeckert, C.J.: The MGED Ontology: a resource for semantics-based description of microarray experiments. Bioinformatics 22, 866–873 (2006)CrossRefGoogle Scholar
  88. Williams, N.: Bioinformatics: How to Get Databases Talking the Same Language. Science 275, 301–302 (1997)CrossRefGoogle Scholar
  89. Wingender, E.: Gene Regulation in Eukaryotes. Wiley-VCH, Weinheim (1993)Google Scholar
  90. Wingender, E., Dietze, P., Karas, H., Knüppel, R.: TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Research 24, 238–241 (1996)CrossRefGoogle Scholar
  91. Yang, S., Bhowmick, S.S., Madria, S.: Bio2X: a rule-based approach for semi-automatic transformation of semi-structured biological data to XML. Data and Knowledge Engineering 52, 249–271 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Amandeep S. Sidhu
    • 1
  • Matthew Bellgard
    • 1
  1. 1.WA Centre for Comparative GenomicsMurdoch UniversityPerthAustralia

Personalised recommendations