Clustering semantically heterogeneous distributed aggregate databases

Zhang, Shuai; McClean, Sally I.; Scotney, Bryan W.

doi:10.1007/s10115-012-0588-4

Clustering semantically heterogeneous distributed aggregate databases

Regular Paper
Published: 22 December 2012

Volume 38, pages 331–364, (2014)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Shuai Zhang¹,
Sally I. McClean¹ &
Bryan W. Scotney¹

394 Accesses
Explore all metrics

Abstract

Databases developed independently in a common open distributed environment may be heterogeneous with respect to both data schema and the embedded semantics. Managing schema and semantic heterogeneities brings considerable challenges to learning from distributed data and to support applications involving cooperation between different organisations. In this paper, we are concerned mainly with heterogeneous databases that hold aggregates on a set of attributes, which are often the result of materialised views of native large-scale distributed databases. A model-based clustering algorithm is proposed to construct a mixture model where each component corresponds to a cluster which is used to capture the contextual heterogeneity among databases from different populations. Schema heterogeneity, which can be recast as incomplete information, is handled within the clustering process using Expectation-Maximisation estimation and integration is carried out within a clustering iteration. Our proposed algorithm resolves the schema heterogeneity as part of the clustering process, thus avoiding transformation of the data into a unified schema. Results of algorithm evaluation on classification, scalability and reliability, using both real and synthetic data, demonstrate that our algorithm can achieve good performance by incorporating all of the information from available heterogeneous data. Our clustering approach has great potential for scalable knowledge discovery from semantically heterogeneous databases and for applications in an open distributed environment, such as the Semantic Web.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering-Based Aggregation of High-Utility Patterns from Unknown Multi-database

DKDD_C: A Clustering-Based Approach for Distributed Knowledge Discovery

Distributed Gaussian Mixture Model Summarization Using the MapReduce Framework

References

Aïtelhadj A, Boughanem M, Mezghiche M, Souam F (2011) Using structural similarity for clustering XML documents. Knowl Inf Syst 32(1):109–139
Article Google Scholar
Berners-Lee T, Hendler J, Lassila O (2001) The semantic web. Sci. Am. 284(5):34–43
Article Google Scholar
Cadez I, Gaffney S, Smyth P (2000) A general probabilistic framework for clustering individuals (Technical report). Department of Information and Computer Science, University of California, Irvine
Google Scholar
Caragea D, Bao J, Pathak J, Silvescu A, Andorf C, Dobbs D, Honavar V (2005) Information integration from semantically heterogeneous biological data sources. In: Proceedings of the international workshop on database and expert systems applications. Las Vegas, Nevada, pp 580–584
Caragea D, Pathak J, Honavar VG (2004) Learning classifiers from semantically heterogeneous data. In: Proceedings of the international conference on ontologies, databases, and applications of semantics for large scale information systems, Agia. Springer, Berlin, pp 963–980
Chen H, Finin T, Joshi A (2003) An ontology for context-aware pervasive computing environments. Knowl Eng Rev 18(3):197–207
Article Google Scholar
Das K, Bhaduri K, Kargupta H (2010) A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks. Knowl Inf Syst 24(3):341–367
Article Google Scholar
Dasgupta A, Raftery AE (1998) Detecting features is spatial point processes with clutter via model-based clustering. J Am Stat Assoc 93(441):294–302
Article MATH Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm. J R Stat Soc 39(1):1–38
MATH MathSciNet Google Scholar
Doan A, Domingos P, Halevy AY (2001) Reconciling schemas of disparate data sources: a machine-learning approach. In: Proceedings of the ACM SIGMOD Conference on Management of data, California, USA. ACM Press, New York, pp 509–520
Doan A, Halevy AY (2005) Semantic integration research in the database community: a brief survey. AI Mag. 26(1):83–94
Google Scholar
Embley DW, Xu L, Ding YH (2004) Automatic direct and indirect schema mapping: experiences and lessons learned. ACM SIGMOD Record 33(4):14–19
Article Google Scholar
Farquhar A, Fikes R, Rice J (1997) Tools for assembling modular ontologies in Ontolingua (Technical report). Knowledge Systems Laboratory Stanford University, Stanford
Google Scholar
Forman G, Zhang B (2000) Distributed data clustering can be efficient and exact. SIGKDD Explor 2(2):34–38
Article Google Scholar
Fraley C, Raftery AE (1998) How many clusters? Which clustering methods? Answers via model-based cluster analysis. Comput J 41:578–588
Article MATH Google Scholar
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631
Article MATH MathSciNet Google Scholar
Garcia-molina H, Papakonstantinou Y, Quass D, Sagiv Y, Ullman J, Vassalos V, Widom J (1997) The TSIMMIS approach to mediation: data models and languages. J Intell Inf Syst 8:117–132
Article Google Scholar
Gruber TR (1993) A translation approach to portable ontology specifications. Knowl Acquis 5(2):199–220
Article Google Scholar
Gu T, Wang XH, Pung HK, Zhang DQ (2004) An ontology-based context model in intelligent environments. In: Proceedings of the communication networks and distributed systems modeling and simulation conference, San Diego, California. Society for Modeling and Simulation International, pp 270–275
Guo Y-F, Lin X, Teng Z, Xue X, Fan J (2012) A covariance-free iterative algorithm for distributed principal component analysis on vertically partitioned data. Pattern Recognit 45(3):1211–1219
Article MATH Google Scholar
He J, Lan M, Tan C-L, Sung S-Y, Low H-B (2004) Initialization of cluster refinement algorithms: A review and comparative study. In: Proceedings of the IEEE international joint conference on neural networks, Budapest, Hungary. IEEE, pp 297–302
Jaiswal A, Miller DJ, Mitra P (2010) Uninterpreted schema matching with embedded value mapping under opaque column names and data values. IEEE Trans Knowl Data Eng 22(2):291–304
Article Google Scholar
Jung JJ (2009) Consensus-based evaluation framework for distributed information retrieval systems. Knowl Inf Syst 18(2):199–211
Article Google Scholar
Kargupta H, Chan P (2000) Advances in distributed and parallel knowledge discovery. AAAI Press /MIT Press, Massachusetts
Google Scholar
Kargupta H, Huang W, Sivakumar K, Johnson E (2001) Distributed clustering using collective principal component analysis. Knowl Inf Syst 3(4):422–448
Article MATH Google Scholar
Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90:773–795
Article MATH Google Scholar
Klyne G, Carroll JJ (2005) Resource Description Framework (RDF): Concepts and Abstract Syntax. W3C
Kullback S (1987) Letter to the editor: the Kullback-Leibler distance. Am Stat 41(4):340–341
Google Scholar
Levy AY (1998) The information manifold approach to data integration. IEEE Intell Syst 13(3):12–16
Google Scholar
Li W-S, Clifton C, Liu S-Y (2000) Database integration using neural network: implementation and experiences. Knowl Inf Syst 2(1):73–96
Article MATH Google Scholar
Madhavan J, Bernstein PA, Rahm E (2001) Generic schema matching with cupid. In: Proceedings of the international conference on very large data bases, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc., Los Altos, pp 49–58
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York
Book MATH Google Scholar
Matlab®(August 2005) The language of technical computing (Service Pack 3). Version 7.1.0.246 (R14). The MathWorks, Inc., United States
McClean S, Scotney B, Morrow P, Greer K (2008) Integrating semantically heterogeneous aggregate views of distributed databases. Distrib Parallel Databases 24(1–3):73–94
Article Google Scholar
McClean S, Scotney B, Rutjes H, Hartkamp J, Karali I, Hatzopoulos M, Lamb J, Defeng M (2004) MISSION: an agent-based system for semantic integration of heterogeneous distributed statistical information sources. In: Proceedings of the international conference on scientific and statistical database management Greece. IEEE, pp 337–340
McClean S, Scotney B, Shapcott M (2001) Aggregation of imprecise and uncertain information in databases. IEEE Trans Knowl Data Eng 13(6):902–912
Article Google Scholar
McClean SI, Karali I, Scotney BW, Greer K, Kapos G-D, Páircéir R, Hong J, Bell DA, Hatzopoulos M (2002) Agents for querying distributed statistical databases over the internet. Int J Artif Intell Tools 11(1):63–94
Article Google Scholar
McClean SI, Scotney B, Morrow P, Greer K (2005) Knowledge discovery by probabilistic clustering of distributed databases. Data Knowl Eng 54(2):189–210
Article Google Scholar
McClean SI, Scotney BW, Greer K (2003) A scalable approach to integrating heterogeneous aggregate views of distributed databases. IEEE Trans Knowl Data Eng 15(1):232–235
Article Google Scholar
Medrano-Soto A, Christen JA, Collado-Vides J (2005) BClass: a Bayesian approach based on mixture models for clustering and classification of heterogeneous biological data. J Stat Softw 13(2):1–18
Google Scholar
Melnykov V, Maitra R (2010) Finite mixture models and model-based clustering. Stat Surv 4:80–116
Article MATH MathSciNet Google Scholar
Merugu S, Ghosh J (2005) A distributed learning framework for heterogeneous data sources. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery in data mining, Chicago, Illinois, USA. ACM, New York, pp 208–217
Park B-H, Kargupta H (2003) Distributed data mining: algorithms, systems, and applications. In: Ye N (ed) Data mining handbook. Lawrence Erlbaum Associates, Mahwah, pp 341–358
Google Scholar
Provost F (2000) Advances in distributed and parallel knowledge discovery. Distributed data mining: scaling up and beyond. AAAI Press/ The MIT Press, Cambridge
Google Scholar
Schwarz G (1978) Estimating the dimensions of a model. Ann Stat 6(2):461–464
Article MATH Google Scholar
Scotney B, McClean S (1999) Efficient knowledge discovery through the integration of heterogeneous data. Inf Softw Technol 41(9):569–578
Article Google Scholar
Scotney B, McClean S, Rodgers M (1999) Optimal and efficient integration of heterogeneous summary tables in a distributed database. Data Knowl Eng 29(3):337–350
Google Scholar
Scotney B, McClean S, Zhang S (2006) Interoperability and integration of independent heterogeneous distributed databases over the internet. In: Proceedings of the BNCOD, Belfast, Northern Ireland. Springer, Berlin, pp 250–253
Taylor NE, Ives ZG (2006) Reconciling while tolerating disagreement in collaborative data sharing. In: Proceedings of the ACM SIGMOD international conference on management of data, Chicago, IL, USA. ACM, New York, pp 13–24
Tsoumakas G, Angelis L, Vlahavas I (2004) Clustering classifiers for knowledge discovery from physically distributed databases. Data Knowl Eng 49(3):223–242
Google Scholar
U.S. CensusBureau (2003) 2000 census of population and housing, summary social, economic and housing characteristics. PHC-2-4, Arizona. U.S. Department of Commerce, Economics and Statistics Administration, Washington, DC
Wirth R, Borth M, Hipp J (2001) When distribution is part of the semantics: a new problem class for distributed knowledge discovery. In: Proceedings of the ECML and PKDD workshop on ubiquitous data mining for mobile and distributed environments, Freiburg, Germany, pp 56–64
Zhang J, Caragea D, Honavar V (2005) Learning Ontology-aware Classifiers. In: Proceedings of the 8th international conference discovery science, Singapore. Springer, Berlin, pp 294–307
Zhang S, McClean S, Scotney B (2006) Model-based clustering on semantically heterogeneous distributed databases on the internet. In: Proceedings of the AAAI fall symposium on semantic web for collaborative knowledge acquisition, Arlington, Virginia. AAAI, pp 78–85
Zhang S, McClean S, Scotney B (2007) Knowledge discovery from semantically heterogeneous aggregate databases using model-based clustering. In: Proceedings of the BNCOD, Glasgow, Scotland. Springer, Berlin, pp 190–202

Download references

Acknowledgments

The authors wish to acknowledge support from the EPSRC through the MATCH programme (EP/F063822/1 and EP/G012393/1). The views expressed are those of the authors alone.

Author information

Authors and Affiliations

School of Computing and Information Engineering, University of Ulster, Coleraine, UK
Shuai Zhang, Sally I. McClean & Bryan W. Scotney

Authors

Shuai Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Sally I. McClean
View author publications
You can also search for this author in PubMed Google Scholar
Bryan W. Scotney
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuai Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, S., McClean, S.I. & Scotney, B.W. Clustering semantically heterogeneous distributed aggregate databases. Knowl Inf Syst 38, 331–364 (2014). https://doi.org/10.1007/s10115-012-0588-4

Download citation

Received: 18 April 2011
Revised: 18 April 2011
Accepted: 30 November 2012
Published: 22 December 2012
Issue Date: February 2014
DOI: https://doi.org/10.1007/s10115-012-0588-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering semantically heterogeneous distributed aggregate databases

Abstract

Access this article

Similar content being viewed by others

Clustering-Based Aggregation of High-Utility Patterns from Unknown Multi-database

DKDD_C: A Clustering-Based Approach for Distributed Knowledge Discovery

Distributed Gaussian Mixture Model Summarization Using the MapReduce Framework

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Clustering semantically heterogeneous distributed aggregate databases

Abstract

Access this article

Similar content being viewed by others

Clustering-Based Aggregation of High-Utility Patterns from Unknown Multi-database

DKDD_C: A Clustering-Based Approach for Distributed Knowledge Discovery

Distributed Gaussian Mixture Model Summarization Using the MapReduce Framework

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation