Advertisement

Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

  • Doina Caragea
  • Jyotishman Pathak
  • Jie Bao
  • Adrian Silvescu
  • Carson Andorf
  • Drena Dobbs
  • Vasant Honavar
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3615)

Abstract

We present INDUS (Intelligent Data Understanding System), a federated, query-centric system for knowledge acquisition from autonomous, distributed, semantically heterogeneous data sources that can be viewed (conceptually) as tables. INDUS employs ontologies and inter-ontology mappings, to enable a user or an application to view a collection of such data sources (regardless of location, internal structure and query interfaces) as though they were a collection of tables structured according to an ontology supplied by the user. This allows INDUS to answer user queries against distributed, semantically heterogeneous data sources without the need for a centralized data warehouse or a common global ontology. We used INDUS framework to design algorithms for learning probabilistic models (e.g., Naive Bayes models) for predicting GO functional classification of a protein based on training sequences that are distributed among SWISSPROT and MIPS data sources. Mappings such as EC2GO and MIPS2GO were used to resolve the semantic differences between these data sources when answering queries posed by the learning algorithms. Our results show that INDUS can be successfully used for integrative analysis of data from multiple sources needed for collaborative discovery in computational biology.

Keywords

Knowledge Acquisition Information Integration Heterogeneous Data Source Alpha Beta Semantic Correspondence 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Andorf, C., Silvescu, A., Dobbs, D., Honavar, V.: Learning classifiers for assigning protein sequences to gene ontology functional families. In: Fifth International Conference on Knowledge Based Computer Systems (KBCS 2004), India (2004)Google Scholar
  2. 2.
    Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J., Davis, A., Dolinski, K., Dwight, S., Eppig, J., Harris, M., Hill, D., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J., Richardson, J., Ringwald, M., Rubin, G., Sherlock, G.: Gene ontology: tool for unification of biology. Nature Genetics 25(1), 25–29 (2000)CrossRefGoogle Scholar
  3. 3.
    Baader, F., Nutt, W.: Basic description logics. In: Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P.F. (eds.) The Description Logic Handbook: Theory, Implementation, and Applications, pp. 43–95. Cambridge University Press, Cambridge (2003)Google Scholar
  4. 4.
    Bao, J., Honavar, V.: Collaborative ontology building with wiki@nt - a multi-agent based ontology building environment. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298. Springer, Heidelberg (2004)Google Scholar
  5. 5.
    Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American (May 2001)Google Scholar
  6. 6.
    Bonatti, P., Deng, Y., Subrahmanian, V.: An ontology-extended relational algebra. In: Proceedings of the IEEE Conference on Information Integration and Reuse, pp. 192–199. IEEE Press, Los Alamitos (2003)Google Scholar
  7. 7.
    Borgida, A., Serafini, L.: Distributed description logics: Directed domain correspondences in federated information sources. In: Proceedings of the Intenational Conference on Cooperative Information Systems (2002)Google Scholar
  8. 8.
    Calvanese, D., Giacomo, G.D., Lenzerini, M.: A framework for ontology integration. In: Proceedings of the international semantic web working symposium, Stanford, USA, pp. 303–316 (2001)Google Scholar
  9. 9.
    Caragea, D., Pathak, J., Honavar, V.: Learning classifiers from semantically heterogeneous data. In: Proceedings of the International Conference on Ontologies, Databases, and Applications of Semantics for Large Scale Information Systems (2004)Google Scholar
  10. 10.
    Caragea, D., Silvescu, A., Honavar, V.: A framework for learning from distributed data using sufficient statistics and its application to learning decision trees. International Journal of Hybrid Intelligent Systems 1(2) (2004)Google Scholar
  11. 11.
    Chen, J., Chung, S., Wong, L.: The Kleisli query system as a backbone for bioinformatics data integration and analisis. Bioinformatics, 147–188 (2003)Google Scholar
  12. 12.
    Davidson, S., Crabtree, J., Brunk, B., Schug, J., Tannen, V., Overton, G., Stoeckert, C.: K2/Kleisli and GUS: experiments in integrated access to genomic data sources. IBM Journal 40(2) (2001)Google Scholar
  13. 13.
    Eckman, B.: A practitioner’s guide to data management and data integration in bioinformatics. Bioinformatics, 3–74 (2003)Google Scholar
  14. 14.
    Eckman, B., Hernndez, M., Ho, H., Naumann, F., Popa, L.: Schema mapping and data integration with clio (demo and poster). In: Proceedings of the International Conference on Intelligent Systems for Molecular Biology (ISMB 2002), Edmonton, Canada (2002)Google Scholar
  15. 15.
    Etzold, T., Harris, H., Beulah, S.: SRS: An integration platform for databanks and analysis tools in bioinformatics. Bioinformatics Managing Scientific Data, 35–74 (2003)Google Scholar
  16. 16.
    Fikes, R., Farquhar, A., Rice, J.: Tools for assembling modular ontologies. In: The Fourteenth National Conference on Artificial Intelligence (1997)Google Scholar
  17. 17.
    Gruber, T.: Ontolingua: A mechanism to support portable ontologiesGoogle Scholar
  18. 18.
    Haas, L., Schwarz, P., Kodali, P., Kotlar, E., Rice, J., Swope, W.: DiscoveryLink: a system for integrated access to life sciences data sources. IBM System Journal 40(2) (2001)Google Scholar
  19. 19.
    Hull, R.: Managing semantic heterogeneity in databases: A theoretical perspective. In: PODS, Tucson, Arizona, pp. 51–61 (1997)Google Scholar
  20. 20.
    Kargupta, H., Chan, P.: Advances in Distributed and Parallel Knowledge Discovery. AAAI/MIT, Cambridge (2000)Google Scholar
  21. 21.
    Kementsietsidis, A., Arenas, M., Miller, R.J.: Mapping data in peer-to-peer systems: Semantics and algorithmic issues. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 325–336 (2003)Google Scholar
  22. 22.
    Kosky, A., Chen, I., Markowitz, V., Szeto, E.: Exploring heterogeneous biological databases: Tools and applications. In: Schek, H.-J., Saltor, F., Ramos, I., Alonso, G. (eds.) EDBT 1998. LNCS, vol. 1377, p. 499. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  23. 23.
    Mitra, P., Wiederhold, G., Kersten, M.: A graph-oriented model for articulation of ontology interdependencies. In: Conference on Extending Database Technology, Konstanz, Germany (2000)Google Scholar
  24. 24.
    Noy, N.F., Fergerson, R.W., Musen, M.A.: The knowledge model of protege-2000: Combining interoperability and flexibility. In: Dieng, R., Corby, O. (eds.) EKAW 2000. LNCS (LNAI), vol. 1937, pp. 17–32. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  25. 25.
    Shaker, R., Mork, P., Brockenbrough, J.S., Donelson, L., Tarczy-Hornoch, P.: The biomediator system as a tool for integrating biologic databases on the web. In: Proceedings of the Workshop on Information Integration on the Web (held in conjunction with VLDB 2004), Toronto, ON (2004)Google Scholar
  26. 26.
    Smith, M., Welty, C., McGuinness, D.: OWL Web Ontology Language Guide. W3C Recommendation (2004)Google Scholar
  27. 27.
    Staab, S., Studer, R.: Handbook on Ontologies. In: International Handbooks on Information Systems. Springer, Heidelberg (2004)Google Scholar
  28. 28.
    Stevens, R., Goble, C., Paton, N., Becchofer, S., Ng, G., Baker, P., Bass, A.: Complex query formulation over diverse sources in tambis. Bioinformatics, 189–220 (2003)Google Scholar
  29. 29.
    Tannen, V., Davidson, S., Harker, S.: The information integration in K2. Bioinformatics, 225–248 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Doina Caragea
    • 1
    • 4
  • Jyotishman Pathak
    • 1
    • 4
  • Jie Bao
    • 1
    • 4
  • Adrian Silvescu
    • 1
    • 4
  • Carson Andorf
    • 1
    • 3
    • 4
  • Drena Dobbs
    • 2
    • 3
    • 4
  • Vasant Honavar
    • 1
    • 2
    • 3
    • 4
  1. 1.Department of Computer ScienceAI Research Laboratory
  2. 2.Department of Genetics, Development and Cell Biology1210 Molecular Biology 
  3. 3.Bioinformatics and Computational Biology Program2014 Molecular Biology 
  4. 4.Computational Intelligence, Learning and Discovery Program214 Atanasoff Hall Iowa State UniversityAmesUSA

Personalised recommendations