Distributed and Parallel Databases

, Volume 24, Issue 1–3, pp 73–94 | Cite as

Integrating semantically heterogeneous aggregate views of distributed databases

  • Sally McClean
  • Bryan Scotney
  • Philip Morrow
  • Kieran Greer
Article

Abstract

In statistical databases and data warehousing applications it is commonly the case that aggregate views are maintained as an underlying mechanism for summarising information. Where the databases or applications are distributed, or arise from independent data collections or system developments, there may be incompatibility, heterogeneity, and data inconsistency. These challenges need to be overcome if federations of aggregated databases are to be successfully incorporated into systems for database management, querying, retrieval, and knowledge discovery.

In this paper we address the issue of integrating aggregate views that have semantically heterogeneous classification schemes. In previous work we have developed a methodology that is efficient but that cannot easily handle data inconsistencies. Our previous approach is therefore not particularly well-suited to very large databases or federations of large numbers of databases. We now address these scalability issues by introducing a methodology for heterogeneous aggregate view integration that constructs a dynamic shared ontology to which each of the aggregate views can be explicitly related. A maximum likelihood technique, implemented using the EM (Expectation-Maximisation) algorithm, is used to inherently handle data inconsistencies in the computation of integrated aggregates that are described in terms of the dynamic shared ontology.

Keywords

Distributed databases Aggregate views Heterogeneous data Dynamic shared ontologies 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Anand, S.S., Scotney, B.W., Tan, M.G., McClean, S.I., Bell, D.A., Hughes, J.G., Magill, I.C.: Designing a kernel for data mining. IEEE Expert March-April, 65–74 (1997) CrossRefGoogle Scholar
  2. 2.
    AnHai, D., Pedro, D., Alon, Y.H.: Reconciling schemas of disparate data sources: a machine-learning approach. In: ACM SIGMOD Conf. on Management of Data, pp. 509–520. Assoc. Comput. Mach., New York (2001) Google Scholar
  3. 3.
    Bergamaschi, S., et al.: Semantic integration of heterogeneous information sources. Data Knowl. Eng. 36(3), 215–249 (2001) MATHCrossRefGoogle Scholar
  4. 4.
    Caragea, D., et al.: Information integration from semantically heterogeneous biological data sources. In: Proceedings of the 16th Intl. Workshop on Database and Expert Systems Applications, Las Vegas, Nevada, pp. 580–584 (2005) Google Scholar
  5. 5.
    Chen, R., Krishnamoorthy, S.: A new algorithm for learning parameters of a Bayesian Network from distributed data. In: IEEE International Conference on Data Mining, Maebashi, Japan, pp. 585–588 (2002) Google Scholar
  6. 6.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–38 (1977) MATHMathSciNetGoogle Scholar
  7. 7.
    Doan, A., Halevy, A.Y.: Semantic integration research in the database community: a brief survey. AI Mag. 26(1), 83–94 (2005) Google Scholar
  8. 8.
    Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J.D., Vassalos, V., Widom, J.: The TSIMMIS approach to mediation: data models and languages. J. Intell. Inf. Syst. 8(2), 117–132 (1997) CrossRefGoogle Scholar
  9. 9.
    Kargupta, H., Chan, P.: Advances in Distributed and Parallel Knowledge Discovery. AAAI Press/MIT Press, Cambridge (2000) Google Scholar
  10. 10.
    Kittler, J., et al.: On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 226–238 (1998) CrossRefGoogle Scholar
  11. 11.
    Levy, A.: The information manifold approach to data integration. IEEE Intell. Syst. 1312–1316 (1998) Google Scholar
  12. 12.
    Lim, E.-P., Srivastava, J., Shekhar, S.: An evidential reasoning approach to attribute value conflict resolution in database management. IEEE Trans. Knowl. Data Eng. 8, 707–723 (1996) CrossRefGoogle Scholar
  13. 13.
    Malvestuto, F.M.: The derivation problem for summary data. In: Proc. ACM-SIGMOD Conf. on Management of Data, pp. 82–89. Assoc. Comput. Mach., New York (1988) Google Scholar
  14. 14.
    McClean, S.I., Scotney, B.W.: Using evidence theory for the integration of distributed databases. Int. J. Intell. Syst. 12(10), 763–776 (1997) CrossRefGoogle Scholar
  15. 15.
    McClean, S.I., Scotney, B.W., Shapcott, C.M.: Aggregation of imprecise and uncertain information in databases. IEEE Trans. Knowl. Data Eng. 13(6), 902–912 (2001) CrossRefGoogle Scholar
  16. 16.
    McClean, S.I., Scotney, B.W., Greer, K.R.C.: A scalable approach to integrating heterogeneous aggregate views of distributed databases. IEEE Trans. Knowl. Data Eng. 15(1), 232–235 (2003) CrossRefGoogle Scholar
  17. 17.
    McClean, S.I., Scotney, B.W., Morrow, P.J., Greer, K.R.C.: Knowledge discovery by probabilistic clustering of distributed databases. Data Knowl. Eng. 54, 189–210 (2005) CrossRefGoogle Scholar
  18. 18.
    Sadreddini, M.H., Bell, D.A., McClean, S.I.: A model for integration of raw data and aggregate views in heterogeneous statistical databases. Database Technol. 4(2), 115–127 (1991) Google Scholar
  19. 19.
    Sadreddini, M.H., Bell, D.A., McClean, S.I.: A framework for query optimization in distributed statistical databases. Inf. Softw. Technol. 6, 363–377 (1992) CrossRefGoogle Scholar
  20. 20.
    Scotney, B.W., McClean, S.I.: Efficient knowledge discovery through the integration of heterogeneous data. Inf. Softw. Technol. 41, 569–578 (1999). Special Issue-Knowledge Discovery and Data Mining CrossRefGoogle Scholar
  21. 21.
    Scotney, B.W., McClean, S.I., Rodgers, M.C.: Optimal and efficient integration of heterogeneous summary tables in a distributed database. Data Knowl. Eng. 29, 337–350 (1999) MATHCrossRefGoogle Scholar
  22. 22.
    Tsoumakas, G., Angelis, L., Vlahavas, I.: Clustering classifiers for knowledge discovery from physically distributed databases. Data Knowl. Eng. 49(3), 223–242 (2004) CrossRefGoogle Scholar
  23. 23.
    Vardi, Y., Lee, D.: From image deblurring to optimal investments: maximum likelihood solutions for positive linear inverse problems (with discussion), J. R. Stat. Soc. Ser. B 569–612 (1993) Google Scholar
  24. 24.
    Yin, X., Han, J., Yang, J., Yu, P.S.: Efficient classification across multiple database relations: a crossmine approach. IEEE Trans. Knowl. Data Eng. 18(6), 770–783 (2006) CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Sally McClean
    • 1
  • Bryan Scotney
    • 1
  • Philip Morrow
    • 1
  • Kieran Greer
    • 1
  1. 1.School of Computing and Information EngineeringUniversity of UlsterColeraineNorthern Ireland

Personalised recommendations