DILS 2004: Data Integration in the Life Sciences pp 156-171 | Cite as
Columba: Multidimensional Data Integration of Protein Annotations
Abstract
We present COLUMBA, an integrated database of protein annotations. COLUMBA is centered around proteins whose structure has been resolved and adds as much annotations as possible to those proteins, describing their proper-ties such as function, sequence, classification, textual description, participation in pathways, etc. Annotations are extracted from seven (soon eleven) external data sources. In this paper we describe the motivation for building COLUMBA, its integrational architecture and the software tools we developed for the integrated data sources and keeping COLUMBA up-to-date. We put special focus on two aspects: First, COLUMBA does not try to remove redundancies and overlaps in data sources, but views each data source as a proper dimension describing a protein. We explain the advantages of this approach compared to a tighter semantic integration as pursued in many other projects. Second, we highlight our current investigations regarding the quality of data in COLUMBA by identification of hot spots of poor data quality.
Keywords
Protein Data Bank Global Schema Nucleic Acid Research Protein Annotation Protein Data Bank EntryPreview
Unable to display preview. Download preview PDF.
References
- 1.Bernstein, F.C., Koetzle, T.F., Williams, G.J.B., Meyer Jr., E.F., Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T., Tasumi, M.: The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol. 112, 535–542 (1977)CrossRefGoogle Scholar
- 2.Karp, P.D.: What we do not know about sequence analysis and sequence databases. Bioinformatics 14(9), 753–754 (1998)CrossRefGoogle Scholar
- 3.Devos, D., Valencia, A.: Intrinsic errors in genome annotation. Trends in Genetics 17(8), 429–431 (2001)CrossRefGoogle Scholar
- 4.Gilks, W.R., Audit, B., De Angelis, D., Tsoka, S., Ouzounis, C.A.: Modeling the percolation of annotation erros in a database of protein sequences. Bioinformatics 18(12), 1641–1649 (2002)CrossRefGoogle Scholar
- 5.Kanehisa, M., Goto, S., Kawashima, S., Nakaya, A.: The KEGG database at GenomeNet. Nucleic Acid Research 30(1), 42–46 (2002)CrossRefGoogle Scholar
- 6.Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995)Google Scholar
- 7.Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern of recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12), 2577–2637 (1983)CrossRefGoogle Scholar
- 8.Bairoch, A.: The ENZYME database. Nucleic Acid Research 28(1), 304–305 (2000)CrossRefGoogle Scholar
- 9.Preissner, R., Goede, R., Froemmel, C.: Dictionary of interfaces in proteins (DIP). Databank of complementary molecular surface patches. J. Mol. Biol. 280(3), 535–550 (1998)CrossRefGoogle Scholar
- 10.Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M.: CATH- A Hierarchic Classification of Protein Domain Structures. Structure 5(8), 1093–1108 (1997)CrossRefGoogle Scholar
- 11.Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., Schneider, M.: The Swiss-Prot protein knowledgebase and its supplement TrEMBL. Nucleic Acids Research 31(1), 365–370 (2003)CrossRefGoogle Scholar
- 12.Wheeler, D.L., Church, D.M., Federhen, S., Lash, A.E., Madden, T.L., Pontius, J.U., Schuler, G.D., Schriml, L.M., Sequeira, E., Tatusova, T.A., Wagner, L.: Database resources of the National Center for Biotechnology. Nucleic Acids Res. 31(1), 28–33 (2003)CrossRefGoogle Scholar
- 13.Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics 25(1), 25–29 (2000)Google Scholar
- 14.Wang, G., Dunbrack Jr., R.L.: PISCES: a protein sequence culling server. Bioinformatics 19(12), 1589–1591 (2003)CrossRefGoogle Scholar
- 15.Krause, A., Stoye, J., Vingron, M.: The SYSTERS protein sequence cluster set. Nucleic Acids Res. 28(1), 270–272 (2000)CrossRefGoogle Scholar
- 16.Michal, G.: Biochemical Pathways, Boehringer Mannheim GmbH (1993)Google Scholar
- 17.Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: GenBank. Nucleic Acids Res. 31(1), 23–37 (2003)CrossRefGoogle Scholar
- 18.Lakshmanan, L., Sadri, F., Subramanian, I.: On the Logical Foundation of Schema Integration and Evolution in Heterogeneous Database Systems. In: Ceri, S., Tsur, S., Tanaka, K. (eds.) DOOD 1993. LNCS, vol. 760, pp. 81–100. Springer, Heidelberg (1993)Google Scholar
- 19.Do, H.H., Rahm, E.: COMA - A System for Flexible Combination of Schema Matching Approaches. In: Conference on Very Large Data Bases(VLDB), pp. 610–621 (2002)Google Scholar
- 20.Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Journal 10, 334–350 (2001)MATHCrossRefGoogle Scholar
- 21.Chaudhuri, S., Dayal, U.: An Overview of Data Warehousing and OLAP Technology. SIGMOD Record 26, 65–74 (1997)CrossRefGoogle Scholar
- 22.Greer, D.S., Westbrook, J.D., Bourne, P.E.: An ontology driven architecture for derived representations of macromolecular structure. Bioinformatics 18(9), 1280–1281 (2002)CrossRefGoogle Scholar
- 23.Rahm, E., Do, H.H.: Data Cleaning: Problems and current approaches. IEEE Bulletin of the Technical Committee on Data Engineering 23(4) (2000)Google Scholar
- 24.Bhat, T.N., Bourne, P., Feng, Z., Gilliland, G., Jain, S., Ravichandran, V., Scheider, B., Schneider, K., Thanki, N., Weissig, H., Westbrook, J., Berman, H.M.: The PDB data uniformity project. Nucleic Acid Research 29(1), 214–218 (2001)CrossRefGoogle Scholar
- 25.Boutselakis, H., Dimitropoulos, D., Fillon, J., Golovin, A., Henrick, K., Hussain, A., Ionides, J., John, M., Keller, P.A., Krissinel, E., McNeil, P., Naim, A., Newman, R., Oldfield, T., Pineda, J., Rachedi, A., Copeland, J., Sitnov, A., Sobhany, S., Suarez-Urunea, A., Swaminathan, J., Tagari, M., Tate, J., Tromm, S., Velankar, S., Vranken, W.: E-MSD: the European Bioinformatics Institute Macromolecular Structure Database. Nucleic Acid Research 31(1), 458–462 (2003)CrossRefGoogle Scholar
- 26.Stein, L.: Creating a bioinformatics nation. Nature 417(6885), 119–120 (2002)CrossRefGoogle Scholar
- 27.Laskowski, R.A.: PDBsum: summaries and analyses of PDB structures. Nucleic Acids Research 29(1), 221–222 (2001)CrossRefGoogle Scholar
- 28.Reichert, J., Suhnel, J.: The IMB Jena Image Library of Biological Macromolecules: 2002 update. Nucleic Acids Res. 30(1), 253–254 (2002)CrossRefGoogle Scholar
- 29.Gasteiger, E., Gattiker, A., Hoogland, C., Ivanyi, I., Appel, R.D., Bairoch, A.: ExPASy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Research 31(13), 3784–3788 (2003)CrossRefGoogle Scholar
- 30.Cornell, M., Paton, N.W., Shengli, W., Goble, C.A., Miller, C.J., Kirby, P., Eilbeck, K., Brass, A., Hayes, A., Oliver, S.G.: GIMS - A Data Warehouse for Storage and Analysis of Genome Sequence and Function Data. In: 2nd IEEE International Symposium on Bioinformatics and Bioengineering, Bethesda, Maryland (2001)Google Scholar
- 31.Paton, N.W., Khan, S.A., Hayes, A., Moussouni, F., Brass, A., Eilbeck, K., Goble, C.A., Hubbard, S.J., Oliver, S.G.: Conceptual Modelling of Genomic Information. Bioinformatics 16(6), 548–557 (2000)CrossRefGoogle Scholar
- 32.Müller, H., Freytag, J.C.: Problems, Methods, and Challenges in Comprehensive Data Cleansing. Technical Report HUB-IB-164, Humboldt University Berlin (2003)Google Scholar