Knowledge and Information Systems

, Volume 38, Issue 2, pp 331–364 | Cite as

Clustering semantically heterogeneous distributed aggregate databases

  • Shuai ZhangEmail author
  • Sally I. McClean
  • Bryan W. Scotney
Regular Paper


Databases developed independently in a common open distributed environment may be heterogeneous with respect to both data schema and the embedded semantics. Managing schema and semantic heterogeneities brings considerable challenges to learning from distributed data and to support applications involving cooperation between different organisations. In this paper, we are concerned mainly with heterogeneous databases that hold aggregates on a set of attributes, which are often the result of materialised views of native large-scale distributed databases. A model-based clustering algorithm is proposed to construct a mixture model where each component corresponds to a cluster which is used to capture the contextual heterogeneity among databases from different populations. Schema heterogeneity, which can be recast as incomplete information, is handled within the clustering process using Expectation-Maximisation estimation and integration is carried out within a clustering iteration. Our proposed algorithm resolves the schema heterogeneity as part of the clustering process, thus avoiding transformation of the data into a unified schema. Results of algorithm evaluation on classification, scalability and reliability, using both real and synthetic data, demonstrate that our algorithm can achieve good performance by incorporating all of the information from available heterogeneous data. Our clustering approach has great potential for scalable knowledge discovery from semantically heterogeneous databases and for applications in an open distributed environment, such as the Semantic Web.


Model-based clustering Semantically heterogeneous databases  EM algorithm Unsupervised learning 



The authors wish to acknowledge support from the EPSRC through the MATCH programme (EP/F063822/1 and EP/G012393/1). The views expressed are those of the authors alone.


  1. 1.
    Aïtelhadj A, Boughanem M, Mezghiche M, Souam F (2011) Using structural similarity for clustering XML documents. Knowl Inf Syst 32(1):109–139CrossRefGoogle Scholar
  2. 2.
    Berners-Lee T, Hendler J, Lassila O (2001) The semantic web. Sci. Am. 284(5):34–43CrossRefGoogle Scholar
  3. 3.
    Cadez I, Gaffney S, Smyth P (2000) A general probabilistic framework for clustering individuals (Technical report). Department of Information and Computer Science, University of California, IrvineGoogle Scholar
  4. 4.
    Caragea D, Bao J, Pathak J, Silvescu A, Andorf C, Dobbs D, Honavar V (2005) Information integration from semantically heterogeneous biological data sources. In: Proceedings of the international workshop on database and expert systems applications. Las Vegas, Nevada, pp 580–584Google Scholar
  5. 5.
    Caragea D, Pathak J, Honavar VG (2004) Learning classifiers from semantically heterogeneous data. In: Proceedings of the international conference on ontologies, databases, and applications of semantics for large scale information systems, Agia. Springer, Berlin, pp 963–980Google Scholar
  6. 6.
    Chen H, Finin T, Joshi A (2003) An ontology for context-aware pervasive computing environments. Knowl Eng Rev 18(3):197–207CrossRefGoogle Scholar
  7. 7.
    Das K, Bhaduri K, Kargupta H (2010) A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks. Knowl Inf Syst 24(3):341–367CrossRefGoogle Scholar
  8. 8.
    Dasgupta A, Raftery AE (1998) Detecting features is spatial point processes with clutter via model-based clustering. J Am Stat Assoc 93(441):294–302CrossRefzbMATHGoogle Scholar
  9. 9.
    Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm. J R Stat Soc 39(1):1–38zbMATHMathSciNetGoogle Scholar
  10. 10.
    Doan A, Domingos P, Halevy AY (2001) Reconciling schemas of disparate data sources: a machine-learning approach. In: Proceedings of the ACM SIGMOD Conference on Management of data, California, USA. ACM Press, New York, pp 509–520Google Scholar
  11. 11.
    Doan A, Halevy AY (2005) Semantic integration research in the database community: a brief survey. AI Mag. 26(1):83–94Google Scholar
  12. 12.
    Embley DW, Xu L, Ding YH (2004) Automatic direct and indirect schema mapping: experiences and lessons learned. ACM SIGMOD Record 33(4):14–19CrossRefGoogle Scholar
  13. 13.
    Farquhar A, Fikes R, Rice J (1997) Tools for assembling modular ontologies in Ontolingua (Technical report). Knowledge Systems Laboratory Stanford University, StanfordGoogle Scholar
  14. 14.
    Forman G, Zhang B (2000) Distributed data clustering can be efficient and exact. SIGKDD Explor 2(2):34–38CrossRefGoogle Scholar
  15. 15.
    Fraley C, Raftery AE (1998) How many clusters? Which clustering methods? Answers via model-based cluster analysis. Comput J 41:578–588CrossRefzbMATHGoogle Scholar
  16. 16.
    Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631CrossRefzbMATHMathSciNetGoogle Scholar
  17. 17.
    Garcia-molina H, Papakonstantinou Y, Quass D, Sagiv Y, Ullman J, Vassalos V, Widom J (1997) The TSIMMIS approach to mediation: data models and languages. J Intell Inf Syst 8:117–132CrossRefGoogle Scholar
  18. 18.
    Gruber TR (1993) A translation approach to portable ontology specifications. Knowl Acquis 5(2):199–220CrossRefGoogle Scholar
  19. 19.
    Gu T, Wang XH, Pung HK, Zhang DQ (2004) An ontology-based context model in intelligent environments. In: Proceedings of the communication networks and distributed systems modeling and simulation conference, San Diego, California. Society for Modeling and Simulation International, pp 270–275Google Scholar
  20. 20.
    Guo Y-F, Lin X, Teng Z, Xue X, Fan J (2012) A covariance-free iterative algorithm for distributed principal component analysis on vertically partitioned data. Pattern Recognit 45(3):1211–1219CrossRefzbMATHGoogle Scholar
  21. 21.
    He J, Lan M, Tan C-L, Sung S-Y, Low H-B (2004) Initialization of cluster refinement algorithms: A review and comparative study. In: Proceedings of the IEEE international joint conference on neural networks, Budapest, Hungary. IEEE, pp 297–302Google Scholar
  22. 22.
    Jaiswal A, Miller DJ, Mitra P (2010) Uninterpreted schema matching with embedded value mapping under opaque column names and data values. IEEE Trans Knowl Data Eng 22(2):291–304CrossRefGoogle Scholar
  23. 23.
    Jung JJ (2009) Consensus-based evaluation framework for distributed information retrieval systems. Knowl Inf Syst 18(2):199–211CrossRefGoogle Scholar
  24. 24.
    Kargupta H, Chan P (2000) Advances in distributed and parallel knowledge discovery. AAAI Press /MIT Press, MassachusettsGoogle Scholar
  25. 25.
    Kargupta H, Huang W, Sivakumar K, Johnson E (2001) Distributed clustering using collective principal component analysis. Knowl Inf Syst 3(4):422–448CrossRefzbMATHGoogle Scholar
  26. 26.
    Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90:773–795CrossRefzbMATHGoogle Scholar
  27. 27.
    Klyne G, Carroll JJ (2005) Resource Description Framework (RDF): Concepts and Abstract Syntax. W3CGoogle Scholar
  28. 28.
    Kullback S (1987) Letter to the editor: the Kullback-Leibler distance. Am Stat 41(4):340–341Google Scholar
  29. 29.
    Levy AY (1998) The information manifold approach to data integration. IEEE Intell Syst 13(3):12–16Google Scholar
  30. 30.
    Li W-S, Clifton C, Liu S-Y (2000) Database integration using neural network: implementation and experiences. Knowl Inf Syst 2(1):73–96CrossRefzbMATHGoogle Scholar
  31. 31.
    Madhavan J, Bernstein PA, Rahm E (2001) Generic schema matching with cupid. In: Proceedings of the international conference on very large data bases, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc., Los Altos, pp 49–58Google Scholar
  32. 32.
    Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New YorkCrossRefzbMATHGoogle Scholar
  33. 33.
    Matlab®(August 2005) The language of technical computing (Service Pack 3). Version (R14). The MathWorks, Inc., United StatesGoogle Scholar
  34. 34.
    McClean S, Scotney B, Morrow P, Greer K (2008) Integrating semantically heterogeneous aggregate views of distributed databases. Distrib Parallel Databases 24(1–3):73–94CrossRefGoogle Scholar
  35. 35.
    McClean S, Scotney B, Rutjes H, Hartkamp J, Karali I, Hatzopoulos M, Lamb J, Defeng M (2004) MISSION: an agent-based system for semantic integration of heterogeneous distributed statistical information sources. In: Proceedings of the international conference on scientific and statistical database management Greece. IEEE, pp 337–340Google Scholar
  36. 36.
    McClean S, Scotney B, Shapcott M (2001) Aggregation of imprecise and uncertain information in databases. IEEE Trans Knowl Data Eng 13(6):902–912CrossRefGoogle Scholar
  37. 37.
    McClean SI, Karali I, Scotney BW, Greer K, Kapos G-D, Páircéir R, Hong J, Bell DA, Hatzopoulos M (2002) Agents for querying distributed statistical databases over the internet. Int J Artif Intell Tools 11(1):63–94CrossRefGoogle Scholar
  38. 38.
    McClean SI, Scotney B, Morrow P, Greer K (2005) Knowledge discovery by probabilistic clustering of distributed databases. Data Knowl Eng 54(2):189–210CrossRefGoogle Scholar
  39. 39.
    McClean SI, Scotney BW, Greer K (2003) A scalable approach to integrating heterogeneous aggregate views of distributed databases. IEEE Trans Knowl Data Eng 15(1):232–235CrossRefGoogle Scholar
  40. 40.
    Medrano-Soto A, Christen JA, Collado-Vides J (2005) BClass: a Bayesian approach based on mixture models for clustering and classification of heterogeneous biological data. J Stat Softw 13(2):1–18Google Scholar
  41. 41.
    Melnykov V, Maitra R (2010) Finite mixture models and model-based clustering. Stat Surv 4:80–116CrossRefzbMATHMathSciNetGoogle Scholar
  42. 42.
    Merugu S, Ghosh J (2005) A distributed learning framework for heterogeneous data sources. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery in data mining, Chicago, Illinois, USA. ACM, New York, pp 208–217Google Scholar
  43. 43.
    Park B-H, Kargupta H (2003) Distributed data mining: algorithms, systems, and applications. In: Ye N (ed) Data mining handbook. Lawrence Erlbaum Associates, Mahwah, pp 341–358Google Scholar
  44. 44.
    Provost F (2000) Advances in distributed and parallel knowledge discovery. Distributed data mining: scaling up and beyond. AAAI Press/ The MIT Press, CambridgeGoogle Scholar
  45. 45.
    Schwarz G (1978) Estimating the dimensions of a model. Ann Stat 6(2):461–464CrossRefzbMATHGoogle Scholar
  46. 46.
    Scotney B, McClean S (1999) Efficient knowledge discovery through the integration of heterogeneous data. Inf Softw Technol 41(9):569–578CrossRefGoogle Scholar
  47. 47.
    Scotney B, McClean S, Rodgers M (1999) Optimal and efficient integration of heterogeneous summary tables in a distributed database. Data Knowl Eng 29(3):337–350Google Scholar
  48. 48.
    Scotney B, McClean S, Zhang S (2006) Interoperability and integration of independent heterogeneous distributed databases over the internet. In: Proceedings of the BNCOD, Belfast, Northern Ireland. Springer, Berlin, pp 250–253Google Scholar
  49. 49.
    Taylor NE, Ives ZG (2006) Reconciling while tolerating disagreement in collaborative data sharing. In: Proceedings of the ACM SIGMOD international conference on management of data, Chicago, IL, USA. ACM, New York, pp 13–24Google Scholar
  50. 50.
    Tsoumakas G, Angelis L, Vlahavas I (2004) Clustering classifiers for knowledge discovery from physically distributed databases. Data Knowl Eng 49(3):223–242Google Scholar
  51. 51.
    U.S. CensusBureau (2003) 2000 census of population and housing, summary social, economic and housing characteristics. PHC-2-4, Arizona. U.S. Department of Commerce, Economics and Statistics Administration, Washington, DCGoogle Scholar
  52. 52.
    Wirth R, Borth M, Hipp J (2001) When distribution is part of the semantics: a new problem class for distributed knowledge discovery. In: Proceedings of the ECML and PKDD workshop on ubiquitous data mining for mobile and distributed environments, Freiburg, Germany, pp 56–64Google Scholar
  53. 53.
    Zhang J, Caragea D, Honavar V (2005) Learning Ontology-aware Classifiers. In: Proceedings of the 8th international conference discovery science, Singapore. Springer, Berlin, pp 294–307 Google Scholar
  54. 54.
    Zhang S, McClean S, Scotney B (2006) Model-based clustering on semantically heterogeneous distributed databases on the internet. In: Proceedings of the AAAI fall symposium on semantic web for collaborative knowledge acquisition, Arlington, Virginia. AAAI, pp 78–85Google Scholar
  55. 55.
    Zhang S, McClean S, Scotney B (2007) Knowledge discovery from semantically heterogeneous aggregate databases using model-based clustering. In: Proceedings of the BNCOD, Glasgow, Scotland. Springer, Berlin, pp 190–202Google Scholar

Copyright information

© Springer-Verlag London 2012

Authors and Affiliations

  • Shuai Zhang
    • 1
    Email author
  • Sally I. McClean
    • 1
  • Bryan W. Scotney
    • 1
  1. 1.School of Computing and Information EngineeringUniversity of UlsterColeraineUK

Personalised recommendations