Clustering semantically heterogeneous distributed aggregate databases

Abstract

Databases developed independently in a common open distributed environment may be heterogeneous with respect to both data schema and the embedded semantics. Managing schema and semantic heterogeneities brings considerable challenges to learning from distributed data and to support applications involving cooperation between different organisations. In this paper, we are concerned mainly with heterogeneous databases that hold aggregates on a set of attributes, which are often the result of materialised views of native large-scale distributed databases. A model-based clustering algorithm is proposed to construct a mixture model where each component corresponds to a cluster which is used to capture the contextual heterogeneity among databases from different populations. Schema heterogeneity, which can be recast as incomplete information, is handled within the clustering process using Expectation-Maximisation estimation and integration is carried out within a clustering iteration. Our proposed algorithm resolves the schema heterogeneity as part of the clustering process, thus avoiding transformation of the data into a unified schema. Results of algorithm evaluation on classification, scalability and reliability, using both real and synthetic data, demonstrate that our algorithm can achieve good performance by incorporating all of the information from available heterogeneous data. Our clustering approach has great potential for scalable knowledge discovery from semantically heterogeneous databases and for applications in an open distributed environment, such as the Semantic Web.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

References

  1. 1.

    Aïtelhadj A, Boughanem M, Mezghiche M, Souam F (2011) Using structural similarity for clustering XML documents. Knowl Inf Syst 32(1):109–139

    Article  Google Scholar 

  2. 2.

    Berners-Lee T, Hendler J, Lassila O (2001) The semantic web. Sci. Am. 284(5):34–43

    Article  Google Scholar 

  3. 3.

    Cadez I, Gaffney S, Smyth P (2000) A general probabilistic framework for clustering individuals (Technical report). Department of Information and Computer Science, University of California, Irvine

    Google Scholar 

  4. 4.

    Caragea D, Bao J, Pathak J, Silvescu A, Andorf C, Dobbs D, Honavar V (2005) Information integration from semantically heterogeneous biological data sources. In: Proceedings of the international workshop on database and expert systems applications. Las Vegas, Nevada, pp 580–584

  5. 5.

    Caragea D, Pathak J, Honavar VG (2004) Learning classifiers from semantically heterogeneous data. In: Proceedings of the international conference on ontologies, databases, and applications of semantics for large scale information systems, Agia. Springer, Berlin, pp 963–980

  6. 6.

    Chen H, Finin T, Joshi A (2003) An ontology for context-aware pervasive computing environments. Knowl Eng Rev 18(3):197–207

    Article  Google Scholar 

  7. 7.

    Das K, Bhaduri K, Kargupta H (2010) A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks. Knowl Inf Syst 24(3):341–367

    Article  Google Scholar 

  8. 8.

    Dasgupta A, Raftery AE (1998) Detecting features is spatial point processes with clutter via model-based clustering. J Am Stat Assoc 93(441):294–302

    Article  MATH  Google Scholar 

  9. 9.

    Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm. J R Stat Soc 39(1):1–38

    MATH  MathSciNet  Google Scholar 

  10. 10.

    Doan A, Domingos P, Halevy AY (2001) Reconciling schemas of disparate data sources: a machine-learning approach. In: Proceedings of the ACM SIGMOD Conference on Management of data, California, USA. ACM Press, New York, pp 509–520

  11. 11.

    Doan A, Halevy AY (2005) Semantic integration research in the database community: a brief survey. AI Mag. 26(1):83–94

    Google Scholar 

  12. 12.

    Embley DW, Xu L, Ding YH (2004) Automatic direct and indirect schema mapping: experiences and lessons learned. ACM SIGMOD Record 33(4):14–19

    Article  Google Scholar 

  13. 13.

    Farquhar A, Fikes R, Rice J (1997) Tools for assembling modular ontologies in Ontolingua (Technical report). Knowledge Systems Laboratory Stanford University, Stanford

    Google Scholar 

  14. 14.

    Forman G, Zhang B (2000) Distributed data clustering can be efficient and exact. SIGKDD Explor 2(2):34–38

    Article  Google Scholar 

  15. 15.

    Fraley C, Raftery AE (1998) How many clusters? Which clustering methods? Answers via model-based cluster analysis. Comput J 41:578–588

    Article  MATH  Google Scholar 

  16. 16.

    Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631

    Article  MATH  MathSciNet  Google Scholar 

  17. 17.

    Garcia-molina H, Papakonstantinou Y, Quass D, Sagiv Y, Ullman J, Vassalos V, Widom J (1997) The TSIMMIS approach to mediation: data models and languages. J Intell Inf Syst 8:117–132

    Article  Google Scholar 

  18. 18.

    Gruber TR (1993) A translation approach to portable ontology specifications. Knowl Acquis 5(2):199–220

    Article  Google Scholar 

  19. 19.

    Gu T, Wang XH, Pung HK, Zhang DQ (2004) An ontology-based context model in intelligent environments. In: Proceedings of the communication networks and distributed systems modeling and simulation conference, San Diego, California. Society for Modeling and Simulation International, pp 270–275

  20. 20.

    Guo Y-F, Lin X, Teng Z, Xue X, Fan J (2012) A covariance-free iterative algorithm for distributed principal component analysis on vertically partitioned data. Pattern Recognit 45(3):1211–1219

    Article  MATH  Google Scholar 

  21. 21.

    He J, Lan M, Tan C-L, Sung S-Y, Low H-B (2004) Initialization of cluster refinement algorithms: A review and comparative study. In: Proceedings of the IEEE international joint conference on neural networks, Budapest, Hungary. IEEE, pp 297–302

  22. 22.

    Jaiswal A, Miller DJ, Mitra P (2010) Uninterpreted schema matching with embedded value mapping under opaque column names and data values. IEEE Trans Knowl Data Eng 22(2):291–304

    Article  Google Scholar 

  23. 23.

    Jung JJ (2009) Consensus-based evaluation framework for distributed information retrieval systems. Knowl Inf Syst 18(2):199–211

    Article  Google Scholar 

  24. 24.

    Kargupta H, Chan P (2000) Advances in distributed and parallel knowledge discovery. AAAI Press /MIT Press, Massachusetts

    Google Scholar 

  25. 25.

    Kargupta H, Huang W, Sivakumar K, Johnson E (2001) Distributed clustering using collective principal component analysis. Knowl Inf Syst 3(4):422–448

    Article  MATH  Google Scholar 

  26. 26.

    Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90:773–795

    Article  MATH  Google Scholar 

  27. 27.

    Klyne G, Carroll JJ (2005) Resource Description Framework (RDF): Concepts and Abstract Syntax. W3C

  28. 28.

    Kullback S (1987) Letter to the editor: the Kullback-Leibler distance. Am Stat 41(4):340–341

    Google Scholar 

  29. 29.

    Levy AY (1998) The information manifold approach to data integration. IEEE Intell Syst 13(3):12–16

    Google Scholar 

  30. 30.

    Li W-S, Clifton C, Liu S-Y (2000) Database integration using neural network: implementation and experiences. Knowl Inf Syst 2(1):73–96

    Article  MATH  Google Scholar 

  31. 31.

    Madhavan J, Bernstein PA, Rahm E (2001) Generic schema matching with cupid. In: Proceedings of the international conference on very large data bases, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc., Los Altos, pp 49–58

  32. 32.

    Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York

    Book  MATH  Google Scholar 

  33. 33.

    Matlab®(August 2005) The language of technical computing (Service Pack 3). Version 7.1.0.246 (R14). The MathWorks, Inc., United States

  34. 34.

    McClean S, Scotney B, Morrow P, Greer K (2008) Integrating semantically heterogeneous aggregate views of distributed databases. Distrib Parallel Databases 24(1–3):73–94

    Article  Google Scholar 

  35. 35.

    McClean S, Scotney B, Rutjes H, Hartkamp J, Karali I, Hatzopoulos M, Lamb J, Defeng M (2004) MISSION: an agent-based system for semantic integration of heterogeneous distributed statistical information sources. In: Proceedings of the international conference on scientific and statistical database management Greece. IEEE, pp 337–340

  36. 36.

    McClean S, Scotney B, Shapcott M (2001) Aggregation of imprecise and uncertain information in databases. IEEE Trans Knowl Data Eng 13(6):902–912

    Article  Google Scholar 

  37. 37.

    McClean SI, Karali I, Scotney BW, Greer K, Kapos G-D, Páircéir R, Hong J, Bell DA, Hatzopoulos M (2002) Agents for querying distributed statistical databases over the internet. Int J Artif Intell Tools 11(1):63–94

    Article  Google Scholar 

  38. 38.

    McClean SI, Scotney B, Morrow P, Greer K (2005) Knowledge discovery by probabilistic clustering of distributed databases. Data Knowl Eng 54(2):189–210

    Article  Google Scholar 

  39. 39.

    McClean SI, Scotney BW, Greer K (2003) A scalable approach to integrating heterogeneous aggregate views of distributed databases. IEEE Trans Knowl Data Eng 15(1):232–235

    Article  Google Scholar 

  40. 40.

    Medrano-Soto A, Christen JA, Collado-Vides J (2005) BClass: a Bayesian approach based on mixture models for clustering and classification of heterogeneous biological data. J Stat Softw 13(2):1–18

    Google Scholar 

  41. 41.

    Melnykov V, Maitra R (2010) Finite mixture models and model-based clustering. Stat Surv 4:80–116

    Article  MATH  MathSciNet  Google Scholar 

  42. 42.

    Merugu S, Ghosh J (2005) A distributed learning framework for heterogeneous data sources. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery in data mining, Chicago, Illinois, USA. ACM, New York, pp 208–217

  43. 43.

    Park B-H, Kargupta H (2003) Distributed data mining: algorithms, systems, and applications. In: Ye N (ed) Data mining handbook. Lawrence Erlbaum Associates, Mahwah, pp 341–358

    Google Scholar 

  44. 44.

    Provost F (2000) Advances in distributed and parallel knowledge discovery. Distributed data mining: scaling up and beyond. AAAI Press/ The MIT Press, Cambridge

    Google Scholar 

  45. 45.

    Schwarz G (1978) Estimating the dimensions of a model. Ann Stat 6(2):461–464

    Article  MATH  Google Scholar 

  46. 46.

    Scotney B, McClean S (1999) Efficient knowledge discovery through the integration of heterogeneous data. Inf Softw Technol 41(9):569–578

    Article  Google Scholar 

  47. 47.

    Scotney B, McClean S, Rodgers M (1999) Optimal and efficient integration of heterogeneous summary tables in a distributed database. Data Knowl Eng 29(3):337–350

    Google Scholar 

  48. 48.

    Scotney B, McClean S, Zhang S (2006) Interoperability and integration of independent heterogeneous distributed databases over the internet. In: Proceedings of the BNCOD, Belfast, Northern Ireland. Springer, Berlin, pp 250–253

  49. 49.

    Taylor NE, Ives ZG (2006) Reconciling while tolerating disagreement in collaborative data sharing. In: Proceedings of the ACM SIGMOD international conference on management of data, Chicago, IL, USA. ACM, New York, pp 13–24

  50. 50.

    Tsoumakas G, Angelis L, Vlahavas I (2004) Clustering classifiers for knowledge discovery from physically distributed databases. Data Knowl Eng 49(3):223–242

    Google Scholar 

  51. 51.

    U.S. CensusBureau (2003) 2000 census of population and housing, summary social, economic and housing characteristics. PHC-2-4, Arizona. U.S. Department of Commerce, Economics and Statistics Administration, Washington, DC

  52. 52.

    Wirth R, Borth M, Hipp J (2001) When distribution is part of the semantics: a new problem class for distributed knowledge discovery. In: Proceedings of the ECML and PKDD workshop on ubiquitous data mining for mobile and distributed environments, Freiburg, Germany, pp 56–64

  53. 53.

    Zhang J, Caragea D, Honavar V (2005) Learning Ontology-aware Classifiers. In: Proceedings of the 8th international conference discovery science, Singapore. Springer, Berlin, pp 294–307

  54. 54.

    Zhang S, McClean S, Scotney B (2006) Model-based clustering on semantically heterogeneous distributed databases on the internet. In: Proceedings of the AAAI fall symposium on semantic web for collaborative knowledge acquisition, Arlington, Virginia. AAAI, pp 78–85

  55. 55.

    Zhang S, McClean S, Scotney B (2007) Knowledge discovery from semantically heterogeneous aggregate databases using model-based clustering. In: Proceedings of the BNCOD, Glasgow, Scotland. Springer, Berlin, pp 190–202

Download references

Acknowledgments

The authors wish to acknowledge support from the EPSRC through the MATCH programme (EP/F063822/1 and EP/G012393/1). The views expressed are those of the authors alone.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Shuai Zhang.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Zhang, S., McClean, S.I. & Scotney, B.W. Clustering semantically heterogeneous distributed aggregate databases. Knowl Inf Syst 38, 331–364 (2014). https://doi.org/10.1007/s10115-012-0588-4

Download citation

Keywords

  • Model-based clustering
  • Semantically heterogeneous databases
  • EM algorithm
  • Unsupervised learning