Abstract
Databases developed independently in a common open distributed environment may be heterogeneous with respect to both data schema and the embedded semantics. Managing schema and semantic heterogeneities brings considerable challenges to learning from distributed data and to support applications involving cooperation between different organisations. In this paper, we are concerned mainly with heterogeneous databases that hold aggregates on a set of attributes, which are often the result of materialised views of native large-scale distributed databases. A model-based clustering algorithm is proposed to construct a mixture model where each component corresponds to a cluster which is used to capture the contextual heterogeneity among databases from different populations. Schema heterogeneity, which can be recast as incomplete information, is handled within the clustering process using Expectation-Maximisation estimation and integration is carried out within a clustering iteration. Our proposed algorithm resolves the schema heterogeneity as part of the clustering process, thus avoiding transformation of the data into a unified schema. Results of algorithm evaluation on classification, scalability and reliability, using both real and synthetic data, demonstrate that our algorithm can achieve good performance by incorporating all of the information from available heterogeneous data. Our clustering approach has great potential for scalable knowledge discovery from semantically heterogeneous databases and for applications in an open distributed environment, such as the Semantic Web.
This is a preview of subscription content,
to check access.















References
Aïtelhadj A, Boughanem M, Mezghiche M, Souam F (2011) Using structural similarity for clustering XML documents. Knowl Inf Syst 32(1):109–139
Berners-Lee T, Hendler J, Lassila O (2001) The semantic web. Sci. Am. 284(5):34–43
Cadez I, Gaffney S, Smyth P (2000) A general probabilistic framework for clustering individuals (Technical report). Department of Information and Computer Science, University of California, Irvine
Caragea D, Bao J, Pathak J, Silvescu A, Andorf C, Dobbs D, Honavar V (2005) Information integration from semantically heterogeneous biological data sources. In: Proceedings of the international workshop on database and expert systems applications. Las Vegas, Nevada, pp 580–584
Caragea D, Pathak J, Honavar VG (2004) Learning classifiers from semantically heterogeneous data. In: Proceedings of the international conference on ontologies, databases, and applications of semantics for large scale information systems, Agia. Springer, Berlin, pp 963–980
Chen H, Finin T, Joshi A (2003) An ontology for context-aware pervasive computing environments. Knowl Eng Rev 18(3):197–207
Das K, Bhaduri K, Kargupta H (2010) A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks. Knowl Inf Syst 24(3):341–367
Dasgupta A, Raftery AE (1998) Detecting features is spatial point processes with clutter via model-based clustering. J Am Stat Assoc 93(441):294–302
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm. J R Stat Soc 39(1):1–38
Doan A, Domingos P, Halevy AY (2001) Reconciling schemas of disparate data sources: a machine-learning approach. In: Proceedings of the ACM SIGMOD Conference on Management of data, California, USA. ACM Press, New York, pp 509–520
Doan A, Halevy AY (2005) Semantic integration research in the database community: a brief survey. AI Mag. 26(1):83–94
Embley DW, Xu L, Ding YH (2004) Automatic direct and indirect schema mapping: experiences and lessons learned. ACM SIGMOD Record 33(4):14–19
Farquhar A, Fikes R, Rice J (1997) Tools for assembling modular ontologies in Ontolingua (Technical report). Knowledge Systems Laboratory Stanford University, Stanford
Forman G, Zhang B (2000) Distributed data clustering can be efficient and exact. SIGKDD Explor 2(2):34–38
Fraley C, Raftery AE (1998) How many clusters? Which clustering methods? Answers via model-based cluster analysis. Comput J 41:578–588
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631
Garcia-molina H, Papakonstantinou Y, Quass D, Sagiv Y, Ullman J, Vassalos V, Widom J (1997) The TSIMMIS approach to mediation: data models and languages. J Intell Inf Syst 8:117–132
Gruber TR (1993) A translation approach to portable ontology specifications. Knowl Acquis 5(2):199–220
Gu T, Wang XH, Pung HK, Zhang DQ (2004) An ontology-based context model in intelligent environments. In: Proceedings of the communication networks and distributed systems modeling and simulation conference, San Diego, California. Society for Modeling and Simulation International, pp 270–275
Guo Y-F, Lin X, Teng Z, Xue X, Fan J (2012) A covariance-free iterative algorithm for distributed principal component analysis on vertically partitioned data. Pattern Recognit 45(3):1211–1219
He J, Lan M, Tan C-L, Sung S-Y, Low H-B (2004) Initialization of cluster refinement algorithms: A review and comparative study. In: Proceedings of the IEEE international joint conference on neural networks, Budapest, Hungary. IEEE, pp 297–302
Jaiswal A, Miller DJ, Mitra P (2010) Uninterpreted schema matching with embedded value mapping under opaque column names and data values. IEEE Trans Knowl Data Eng 22(2):291–304
Jung JJ (2009) Consensus-based evaluation framework for distributed information retrieval systems. Knowl Inf Syst 18(2):199–211
Kargupta H, Chan P (2000) Advances in distributed and parallel knowledge discovery. AAAI Press /MIT Press, Massachusetts
Kargupta H, Huang W, Sivakumar K, Johnson E (2001) Distributed clustering using collective principal component analysis. Knowl Inf Syst 3(4):422–448
Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90:773–795
Klyne G, Carroll JJ (2005) Resource Description Framework (RDF): Concepts and Abstract Syntax. W3C
Kullback S (1987) Letter to the editor: the Kullback-Leibler distance. Am Stat 41(4):340–341
Levy AY (1998) The information manifold approach to data integration. IEEE Intell Syst 13(3):12–16
Li W-S, Clifton C, Liu S-Y (2000) Database integration using neural network: implementation and experiences. Knowl Inf Syst 2(1):73–96
Madhavan J, Bernstein PA, Rahm E (2001) Generic schema matching with cupid. In: Proceedings of the international conference on very large data bases, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc., Los Altos, pp 49–58
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York
Matlab®(August 2005) The language of technical computing (Service Pack 3). Version 7.1.0.246 (R14). The MathWorks, Inc., United States
McClean S, Scotney B, Morrow P, Greer K (2008) Integrating semantically heterogeneous aggregate views of distributed databases. Distrib Parallel Databases 24(1–3):73–94
McClean S, Scotney B, Rutjes H, Hartkamp J, Karali I, Hatzopoulos M, Lamb J, Defeng M (2004) MISSION: an agent-based system for semantic integration of heterogeneous distributed statistical information sources. In: Proceedings of the international conference on scientific and statistical database management Greece. IEEE, pp 337–340
McClean S, Scotney B, Shapcott M (2001) Aggregation of imprecise and uncertain information in databases. IEEE Trans Knowl Data Eng 13(6):902–912
McClean SI, Karali I, Scotney BW, Greer K, Kapos G-D, Páircéir R, Hong J, Bell DA, Hatzopoulos M (2002) Agents for querying distributed statistical databases over the internet. Int J Artif Intell Tools 11(1):63–94
McClean SI, Scotney B, Morrow P, Greer K (2005) Knowledge discovery by probabilistic clustering of distributed databases. Data Knowl Eng 54(2):189–210
McClean SI, Scotney BW, Greer K (2003) A scalable approach to integrating heterogeneous aggregate views of distributed databases. IEEE Trans Knowl Data Eng 15(1):232–235
Medrano-Soto A, Christen JA, Collado-Vides J (2005) BClass: a Bayesian approach based on mixture models for clustering and classification of heterogeneous biological data. J Stat Softw 13(2):1–18
Melnykov V, Maitra R (2010) Finite mixture models and model-based clustering. Stat Surv 4:80–116
Merugu S, Ghosh J (2005) A distributed learning framework for heterogeneous data sources. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery in data mining, Chicago, Illinois, USA. ACM, New York, pp 208–217
Park B-H, Kargupta H (2003) Distributed data mining: algorithms, systems, and applications. In: Ye N (ed) Data mining handbook. Lawrence Erlbaum Associates, Mahwah, pp 341–358
Provost F (2000) Advances in distributed and parallel knowledge discovery. Distributed data mining: scaling up and beyond. AAAI Press/ The MIT Press, Cambridge
Schwarz G (1978) Estimating the dimensions of a model. Ann Stat 6(2):461–464
Scotney B, McClean S (1999) Efficient knowledge discovery through the integration of heterogeneous data. Inf Softw Technol 41(9):569–578
Scotney B, McClean S, Rodgers M (1999) Optimal and efficient integration of heterogeneous summary tables in a distributed database. Data Knowl Eng 29(3):337–350
Scotney B, McClean S, Zhang S (2006) Interoperability and integration of independent heterogeneous distributed databases over the internet. In: Proceedings of the BNCOD, Belfast, Northern Ireland. Springer, Berlin, pp 250–253
Taylor NE, Ives ZG (2006) Reconciling while tolerating disagreement in collaborative data sharing. In: Proceedings of the ACM SIGMOD international conference on management of data, Chicago, IL, USA. ACM, New York, pp 13–24
Tsoumakas G, Angelis L, Vlahavas I (2004) Clustering classifiers for knowledge discovery from physically distributed databases. Data Knowl Eng 49(3):223–242
U.S. CensusBureau (2003) 2000 census of population and housing, summary social, economic and housing characteristics. PHC-2-4, Arizona. U.S. Department of Commerce, Economics and Statistics Administration, Washington, DC
Wirth R, Borth M, Hipp J (2001) When distribution is part of the semantics: a new problem class for distributed knowledge discovery. In: Proceedings of the ECML and PKDD workshop on ubiquitous data mining for mobile and distributed environments, Freiburg, Germany, pp 56–64
Zhang J, Caragea D, Honavar V (2005) Learning Ontology-aware Classifiers. In: Proceedings of the 8th international conference discovery science, Singapore. Springer, Berlin, pp 294–307
Zhang S, McClean S, Scotney B (2006) Model-based clustering on semantically heterogeneous distributed databases on the internet. In: Proceedings of the AAAI fall symposium on semantic web for collaborative knowledge acquisition, Arlington, Virginia. AAAI, pp 78–85
Zhang S, McClean S, Scotney B (2007) Knowledge discovery from semantically heterogeneous aggregate databases using model-based clustering. In: Proceedings of the BNCOD, Glasgow, Scotland. Springer, Berlin, pp 190–202
Acknowledgments
The authors wish to acknowledge support from the EPSRC through the MATCH programme (EP/F063822/1 and EP/G012393/1). The views expressed are those of the authors alone.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, S., McClean, S.I. & Scotney, B.W. Clustering semantically heterogeneous distributed aggregate databases. Knowl Inf Syst 38, 331–364 (2014). https://doi.org/10.1007/s10115-012-0588-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-012-0588-4