Integrating semantically heterogeneous aggregate views of distributed databases
- 114 Downloads
In statistical databases and data warehousing applications it is commonly the case that aggregate views are maintained as an underlying mechanism for summarising information. Where the databases or applications are distributed, or arise from independent data collections or system developments, there may be incompatibility, heterogeneity, and data inconsistency. These challenges need to be overcome if federations of aggregated databases are to be successfully incorporated into systems for database management, querying, retrieval, and knowledge discovery.
In this paper we address the issue of integrating aggregate views that have semantically heterogeneous classification schemes. In previous work we have developed a methodology that is efficient but that cannot easily handle data inconsistencies. Our previous approach is therefore not particularly well-suited to very large databases or federations of large numbers of databases. We now address these scalability issues by introducing a methodology for heterogeneous aggregate view integration that constructs a dynamic shared ontology to which each of the aggregate views can be explicitly related. A maximum likelihood technique, implemented using the EM (Expectation-Maximisation) algorithm, is used to inherently handle data inconsistencies in the computation of integrated aggregates that are described in terms of the dynamic shared ontology.
KeywordsDistributed databases Aggregate views Heterogeneous data Dynamic shared ontologies
Unable to display preview. Download preview PDF.
- 2.AnHai, D., Pedro, D., Alon, Y.H.: Reconciling schemas of disparate data sources: a machine-learning approach. In: ACM SIGMOD Conf. on Management of Data, pp. 509–520. Assoc. Comput. Mach., New York (2001) Google Scholar
- 4.Caragea, D., et al.: Information integration from semantically heterogeneous biological data sources. In: Proceedings of the 16th Intl. Workshop on Database and Expert Systems Applications, Las Vegas, Nevada, pp. 580–584 (2005) Google Scholar
- 5.Chen, R., Krishnamoorthy, S.: A new algorithm for learning parameters of a Bayesian Network from distributed data. In: IEEE International Conference on Data Mining, Maebashi, Japan, pp. 585–588 (2002) Google Scholar
- 7.Doan, A., Halevy, A.Y.: Semantic integration research in the database community: a brief survey. AI Mag. 26(1), 83–94 (2005) Google Scholar
- 9.Kargupta, H., Chan, P.: Advances in Distributed and Parallel Knowledge Discovery. AAAI Press/MIT Press, Cambridge (2000) Google Scholar
- 11.Levy, A.: The information manifold approach to data integration. IEEE Intell. Syst. 1312–1316 (1998) Google Scholar
- 13.Malvestuto, F.M.: The derivation problem for summary data. In: Proc. ACM-SIGMOD Conf. on Management of Data, pp. 82–89. Assoc. Comput. Mach., New York (1988) Google Scholar
- 18.Sadreddini, M.H., Bell, D.A., McClean, S.I.: A model for integration of raw data and aggregate views in heterogeneous statistical databases. Database Technol. 4(2), 115–127 (1991) Google Scholar
- 23.Vardi, Y., Lee, D.: From image deblurring to optimal investments: maximum likelihood solutions for positive linear inverse problems (with discussion), J. R. Stat. Soc. Ser. B 569–612 (1993) Google Scholar