Abstract
In this paper, we present a data mining approach to address challenges in the matching of heterogeneous datasets. In particular, we propose solutions to two problems that arise in integrating information from different results of scientific research. The first problem, attribute matching, involves discovery of correspondences among distinct numeric features (attributes) that are used to characterize datasets that have been collected and analyzed in different research labs. The second problem, cluster matching, involves discovery of matchings between patterns (clusters) across datasets. We treat both of these problems together as a multi-objective optimization problem. A multi-objective simulated annealing algorithm is described to find the optimal solution and compared with the genetic algorithm. The utility of this approach is demonstrated in a series of experiments using synthetic and realistic datasets that are designed to simulate heterogeneous data from different sources.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Bae E, Bailey J, Dong G (2010) A clustering comparison measure using density profiles and its application to the discovery of alternate clusterings. Data Min Knowl Discov 21: 427–471
Cortez P, Cerdeira A, Almeida F, Matos T, Reis J (1998) Modeling wine preferences by data mining from physicochemical properties. Decis Support Syst 47(4): 547–553
Deb K (2001) Multi-objective optimization using evolutionary algorithms. Wiley, New York
Dhamankar R, Lee Y, Doan A, Halevy A, Domingos P (2004) iMAP: discovering complex semantic matches between database schemas. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data. ACM, New York
Dien J (2010) The ERP PCA Toolkit: an open source program for advanced statistical analysis of event-related potential data. J Neurosci Methods 187(1): 138–145
Doan A, Domingos P, Levy AY (2000) Learning source description for data integration. In: WebDB (Informal Proceedings), pp 81–86
Fred AL, Jain AK (2003) Robust data clustering. In: IEEE Computer Society conference on computer vision and pattern recognition, vol 2, p 128
Frishkoff GA, Frank RM, Rong J, Dou D, Dien J, Halderman LK (2007) A framework to support automated classification and labeling of brain electromagnetic patterns. Comput Intell Neurosci (CIN): Special Issue EEG/MEG Anal Signal Process 7(3): 1–13
Guyon I, Hur AB, Gunn S, Dror G (2004) Result analysis of the nips 2003 feature selection challenge. Adv Neural Inf Process Syst 17:545–552
Hamers L, Hemeryck Y, Herweyers G, Janssen M, Keters H, Rousseau R, Vanhoutte A (1989) Similarity measures in scientometric research: the Jaccard index versus Salton’s cosine formula. Inf Process Manag 25: 315–318
Holland JH (1992) Adaptation in natural and artificial systems. MIT Press, Cambridge
Kirkpatrick S, Gelatt Jr CD, Vecchi MP (1987) Readings in computer vision: issues, problems, principles, and paradigms. In: Optimization by simulated annealing. Morgan Kaufmann, San Francisco, pp 606–615
Kong X, Shi X, Yu PS (2011) Multi-label collective classification. In: SDM’11, pp 618–629
Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logistic Q 2: 83–97
Larson JA, Navathe SB, Elmasri R (1989) A theory of attributed equivalence in databases with application to schema integration. IEEE Trans Softw Eng 15: 449–463
Li WS, Clifton C (2000) Semint: a tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data Knowl Eng 33(1):49–84
Liu H, Dou D (2011) Breaking the deadlock: simultaneously discovering attribute matching and cluster matching with multi-objective simulated annealing. In: Proceedings of the international conference on ontologies, databases and application of semantics (ODBASE), pp 698–715
Liu H, Frishkoff G, Frank R, Dou D (2010) Ontology-based mining of brainwaves: a sequence similarity technique for mapping alternative descriptions of patterns in event related potentials (ERP) data. In: Proceedings of the 14th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 43–54
Liu H, Frishkoff G, Frank R, Dou D (2012) Sharing and integration of cognitive neuroscience data: metric and pattern matching across heterogeneous ERP datasets. Neurocomputing 92: 156–169
Namata GM, Kok S, Getoor L (2011) Collective graph identification. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’11. ACM, New York, pp 87–95
Rahm E, Bernstein PA (2001) A survey of approaches to automatic schema matching. VLDB J 10:2001
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336): 846–850
Sheth AP, Larson JA, Cornelio A, Navathe SB (1988) A tool for integrating conceptual schemas and user views. In: Proceedings of the fourth international conference on data engineering. IEEE Computer Society, Washington, pp 176–183
Suman B (2003) Simulated annealing based multiobjective algorithm and their application for system reliability. Eng Optim 35: 391–416
Suman B, Kumar P (2006) A survey of simulated annealing as a tool for single and multiobjective optimization. J Oper Res Soc 57: 1143–1160
Wick ML, Rohanimanesh K, Schultz K, McCallum A (2008) A unified approach for schema matching, coreference and canonicalization. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08. ACM, New York, pp 722–730
Zitzler E, Thiele L (1998) Multiobjective optimization using evolutionary algorithms—a comparative case study. Springer, Berlin, pp 292–301
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, H., Dou, D. & Wang, H. Breaking the Deadlock: Simultaneously Discovering Attribute Matching and Cluster Matching with Multi-Objective Metaheuristics. J Data Semant 1, 133–145 (2012). https://doi.org/10.1007/s13740-012-0010-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13740-012-0010-0