Columba: Multidimensional Data Integration of Protein Annotations

  • Kristian Rother
  • Heiko Müller
  • Silke Trissl
  • Ina Koch
  • Thomas Steinke
  • Robert Preissner
  • Cornelius Frömmel
  • Ulf Leser
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2994)


We present COLUMBA, an integrated database of protein annotations. COLUMBA is centered around proteins whose structure has been resolved and adds as much annotations as possible to those proteins, describing their proper-ties such as function, sequence, classification, textual description, participation in pathways, etc. Annotations are extracted from seven (soon eleven) external data sources. In this paper we describe the motivation for building COLUMBA, its integrational architecture and the software tools we developed for the integrated data sources and keeping COLUMBA up-to-date. We put special focus on two aspects: First, COLUMBA does not try to remove redundancies and overlaps in data sources, but views each data source as a proper dimension describing a protein. We explain the advantages of this approach compared to a tighter semantic integration as pursued in many other projects. Second, we highlight our current investigations regarding the quality of data in COLUMBA by identification of hot spots of poor data quality.


Protein Data Bank Global Schema Nucleic Acid Research Protein Annotation Protein Data Bank Entry 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bernstein, F.C., Koetzle, T.F., Williams, G.J.B., Meyer Jr., E.F., Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T., Tasumi, M.: The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol. 112, 535–542 (1977)CrossRefGoogle Scholar
  2. 2.
    Karp, P.D.: What we do not know about sequence analysis and sequence databases. Bioinformatics 14(9), 753–754 (1998)CrossRefGoogle Scholar
  3. 3.
    Devos, D., Valencia, A.: Intrinsic errors in genome annotation. Trends in Genetics 17(8), 429–431 (2001)CrossRefGoogle Scholar
  4. 4.
    Gilks, W.R., Audit, B., De Angelis, D., Tsoka, S., Ouzounis, C.A.: Modeling the percolation of annotation erros in a database of protein sequences. Bioinformatics 18(12), 1641–1649 (2002)CrossRefGoogle Scholar
  5. 5.
    Kanehisa, M., Goto, S., Kawashima, S., Nakaya, A.: The KEGG database at GenomeNet. Nucleic Acid Research 30(1), 42–46 (2002)CrossRefGoogle Scholar
  6. 6.
    Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995)Google Scholar
  7. 7.
    Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern of recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12), 2577–2637 (1983)CrossRefGoogle Scholar
  8. 8.
    Bairoch, A.: The ENZYME database. Nucleic Acid Research 28(1), 304–305 (2000)CrossRefGoogle Scholar
  9. 9.
    Preissner, R., Goede, R., Froemmel, C.: Dictionary of interfaces in proteins (DIP). Databank of complementary molecular surface patches. J. Mol. Biol. 280(3), 535–550 (1998)CrossRefGoogle Scholar
  10. 10.
    Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M.: CATH- A Hierarchic Classification of Protein Domain Structures. Structure 5(8), 1093–1108 (1997)CrossRefGoogle Scholar
  11. 11.
    Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., Schneider, M.: The Swiss-Prot protein knowledgebase and its supplement TrEMBL. Nucleic Acids Research 31(1), 365–370 (2003)CrossRefGoogle Scholar
  12. 12.
    Wheeler, D.L., Church, D.M., Federhen, S., Lash, A.E., Madden, T.L., Pontius, J.U., Schuler, G.D., Schriml, L.M., Sequeira, E., Tatusova, T.A., Wagner, L.: Database resources of the National Center for Biotechnology. Nucleic Acids Res. 31(1), 28–33 (2003)CrossRefGoogle Scholar
  13. 13.
    Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics 25(1), 25–29 (2000)Google Scholar
  14. 14.
    Wang, G., Dunbrack Jr., R.L.: PISCES: a protein sequence culling server. Bioinformatics 19(12), 1589–1591 (2003)CrossRefGoogle Scholar
  15. 15.
    Krause, A., Stoye, J., Vingron, M.: The SYSTERS protein sequence cluster set. Nucleic Acids Res. 28(1), 270–272 (2000)CrossRefGoogle Scholar
  16. 16.
    Michal, G.: Biochemical Pathways, Boehringer Mannheim GmbH (1993)Google Scholar
  17. 17.
    Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: GenBank. Nucleic Acids Res. 31(1), 23–37 (2003)CrossRefGoogle Scholar
  18. 18.
    Lakshmanan, L., Sadri, F., Subramanian, I.: On the Logical Foundation of Schema Integration and Evolution in Heterogeneous Database Systems. In: Ceri, S., Tsur, S., Tanaka, K. (eds.) DOOD 1993. LNCS, vol. 760, pp. 81–100. Springer, Heidelberg (1993)Google Scholar
  19. 19.
    Do, H.H., Rahm, E.: COMA - A System for Flexible Combination of Schema Matching Approaches. In: Conference on Very Large Data Bases(VLDB), pp. 610–621 (2002)Google Scholar
  20. 20.
    Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Journal 10, 334–350 (2001)zbMATHCrossRefGoogle Scholar
  21. 21.
    Chaudhuri, S., Dayal, U.: An Overview of Data Warehousing and OLAP Technology. SIGMOD Record 26, 65–74 (1997)CrossRefGoogle Scholar
  22. 22.
    Greer, D.S., Westbrook, J.D., Bourne, P.E.: An ontology driven architecture for derived representations of macromolecular structure. Bioinformatics 18(9), 1280–1281 (2002)CrossRefGoogle Scholar
  23. 23.
    Rahm, E., Do, H.H.: Data Cleaning: Problems and current approaches. IEEE Bulletin of the Technical Committee on Data Engineering 23(4) (2000)Google Scholar
  24. 24.
    Bhat, T.N., Bourne, P., Feng, Z., Gilliland, G., Jain, S., Ravichandran, V., Scheider, B., Schneider, K., Thanki, N., Weissig, H., Westbrook, J., Berman, H.M.: The PDB data uniformity project. Nucleic Acid Research 29(1), 214–218 (2001)CrossRefGoogle Scholar
  25. 25.
    Boutselakis, H., Dimitropoulos, D., Fillon, J., Golovin, A., Henrick, K., Hussain, A., Ionides, J., John, M., Keller, P.A., Krissinel, E., McNeil, P., Naim, A., Newman, R., Oldfield, T., Pineda, J., Rachedi, A., Copeland, J., Sitnov, A., Sobhany, S., Suarez-Urunea, A., Swaminathan, J., Tagari, M., Tate, J., Tromm, S., Velankar, S., Vranken, W.: E-MSD: the European Bioinformatics Institute Macromolecular Structure Database. Nucleic Acid Research 31(1), 458–462 (2003)CrossRefGoogle Scholar
  26. 26.
    Stein, L.: Creating a bioinformatics nation. Nature 417(6885), 119–120 (2002)CrossRefGoogle Scholar
  27. 27.
    Laskowski, R.A.: PDBsum: summaries and analyses of PDB structures. Nucleic Acids Research 29(1), 221–222 (2001)CrossRefGoogle Scholar
  28. 28.
    Reichert, J., Suhnel, J.: The IMB Jena Image Library of Biological Macromolecules: 2002 update. Nucleic Acids Res. 30(1), 253–254 (2002)CrossRefGoogle Scholar
  29. 29.
    Gasteiger, E., Gattiker, A., Hoogland, C., Ivanyi, I., Appel, R.D., Bairoch, A.: ExPASy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Research 31(13), 3784–3788 (2003)CrossRefGoogle Scholar
  30. 30.
    Cornell, M., Paton, N.W., Shengli, W., Goble, C.A., Miller, C.J., Kirby, P., Eilbeck, K., Brass, A., Hayes, A., Oliver, S.G.: GIMS - A Data Warehouse for Storage and Analysis of Genome Sequence and Function Data. In: 2nd IEEE International Symposium on Bioinformatics and Bioengineering, Bethesda, Maryland (2001)Google Scholar
  31. 31.
    Paton, N.W., Khan, S.A., Hayes, A., Moussouni, F., Brass, A., Eilbeck, K., Goble, C.A., Hubbard, S.J., Oliver, S.G.: Conceptual Modelling of Genomic Information. Bioinformatics 16(6), 548–557 (2000)CrossRefGoogle Scholar
  32. 32.
    Müller, H., Freytag, J.C.: Problems, Methods, and Challenges in Comprehensive Data Cleansing. Technical Report HUB-IB-164, Humboldt University Berlin (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Kristian Rother
    • 1
  • Heiko Müller
    • 2
  • Silke Trissl
    • 2
  • Ina Koch
    • 3
  • Thomas Steinke
    • 4
  • Robert Preissner
    • 1
  • Cornelius Frömmel
    • 1
  • Ulf Leser
    • 2
  1. 1.Institut für BiochemieUniversitätskrankenhaus Charité BerlinBerlinGermany
  2. 2.Institut für InformatikHumboldt-Universität zu BerlinBerlinGermany
  3. 3.Technische Fachhochschule BerlinBerlinGermany
  4. 4.Zuse Institut BerlinBerlinGermany

Personalised recommendations