Skip to main content

Knowledge Discovery from Semantically Heterogeneous Aggregate Databases Using Model-Based Clustering

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNISA,volume 4587)

Abstract

When distributed databases are developed independently, they may be semantically heterogeneous with respect to data granularity, scheme information and the embedded semantics. However, most traditional distributed knowledge discovery (DKD) methods assume that the distributed databases derive from a single virtual global table, where they share the same semantics and data structures. This data heterogeneity and the underlying semantics bring a considerable challenge for DKD. In this paper, we propose a model-based clustering method for aggregate databases, where the heterogeneous schema structure is due to the heterogeneous classification schema. The underlying semantics can be captured by different clusters. The clustering is carried out via a mixture model, where each component of the mixture corresponds to a different virtual global table. An advantage of our approach is that the algorithm resolves the heterogeneity as part of the clustering process without previously having to homogenise the heterogeneous local schema to a shared schema. Evaluation of the algorithm is carried out using both real and synthetic data. Scalability of the algorithm is tested against the number of databases to be clustered; the number of clusters; and the size of the databases. The relationship between performance and complexity is also evaluated. Our experiments show that this approach has good potential for scalable integration of semantically heterogeneous databases.

Keywords

  • Model-based clustering
  • Semantically heterogeneous databases
  • EM algorithm

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American 284(5), 34–43 (2001)

    CrossRef  Google Scholar 

  2. Doan, A., Halevy, A.Y.: Semantic Integration Research in the Database Community: A Brief Survey. AI Magazine 26(1), 83–94 (2005)

    Google Scholar 

  3. Tsoumakas, G., Angelis, L., Vlahavas, I.: Clustering classifiers for knowledge discovery from physically distributed databases. Data & Knowledge Engi. 49(3), 223–242 (2004)

    CrossRef  Google Scholar 

  4. Wirth, R., Borth, M., Hipp, J.: When distribution is part of the semantics: A new problem class for distributed knowledge discovery. In: Proceeding of 5th ECML and PKDD Workshop on Ubiquitous Data Mining for Mobile and Distributed Environments, Freiburg, Germany, pp. 56–64 (2001)

    Google Scholar 

  5. McClean, S., Scotney, B., Shapcott, M.: Aggregation of imprecise and uncertain information in databases. IEEE Transactions on Knowledge and Data Engineering 13(6), 902–912 (2001)

    CrossRef  Google Scholar 

  6. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistics Society, Series B 39(1), 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  7. McClean, S.I., et al.: Knowledge discovery by probabilistic clustering of distributed databases. Data & Knowledge Engineering 54(2), 189–210 (2005)

    CrossRef  Google Scholar 

  8. Fraley, C.,, Raftery, A.E.: Model-Based Clustering, Discriminant Analysis, and Density Estimation. Journal of the American Statistical Society 97(458), 611–631 (2002)

    MATH  MathSciNet  Google Scholar 

  9. McClean, S.I., Scotney, B.W., Greer, K.: A Scalable Approach to Integrating Heterogeneous Aggregate Views of Distributed Databases. IEEE Transactions on Knowledge and Data Engineering 15(1), 232–235 (2003)

    CrossRef  Google Scholar 

  10. Schwarz, G.: Estimating the Dimensions of a Model. The Annals of Statistics 6(2), 461–464 (1978)

    CrossRef  MATH  MathSciNet  Google Scholar 

  11. Dasgupta, A., Raftery, A.E.: Detecting features is spatial point processes with clutter via model-based clustering. Journal of the American Statistical Society 93(441), 294–302 (1998)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Richard Cooper Jessie Kennedy

Rights and permissions

Reprints and Permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhang, S., McClean, S., Scotney, B. (2007). Knowledge Discovery from Semantically Heterogeneous Aggregate Databases Using Model-Based Clustering. In: Cooper, R., Kennedy, J. (eds) Data Management. Data, Data Everywhere. BNCOD 2007. Lecture Notes in Computer Science, vol 4587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73390-4_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-73390-4_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-73389-8

  • Online ISBN: 978-3-540-73390-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics