Abstract.
Most research on attribute identification in database integration has focused on integrating attributes using schema and summary information derived from the attribute values. No research has attempted to fully explore the use of attribute values to perform attribute identification. We propose an attribute identification method that employs schema and summary instance information as well as properties of attributes derived from their instances. Unlike other attribute identification methods that match only single attributes, our method matches attribute groups for integration. Because our attribute identification method fully explores data instances, it can identify corresponding attributes to be integrated even when schema information is misleading. Three experiments were performed to validate our attribute identification method. In the first experiment, the heuristic rules derived for attribute classification were evaluated on 119 attributes from nine public domain data sets. The second was a controlled experiment validating the robustness of the proposed attribute identification method by introducing erroneous data. The third experiment evaluated the proposed attribute identification method on five data sets extracted from online music stores. The results demonstrated the viability of the proposed method.
Similar content being viewed by others
References
Aggarwal CC, Yu PS (1998) Mining large itemsets for association rules. Bulletin of the IEEE Computer Society technical committee on data engineering, 2(1):23-31
Bekker PA, Merckens A, Wansbeek TJ (1994) Identification, equivalent models, and computer algebra. Academic Press, New York
Berndt ER (1991) Determinants of wages from the 1985 current population survey. In: The practice of econometrics: classic and contemporary, Chap. 5, Addison-Wesley, Reading, MA, pp 193-209 [online] http://www.stat.cmu.edu/datasets/.
Biblarz TJ, Raftery AE (1993) The effects of family disruption on social mobility. Am Sociol Rev 58:97-109
Blake C, Keogh E, Merz CJ (1998) UCI repository of machine learning databases. [online] http://www.ics.uci.edu/ mlearn/MLRepository.html
Bureau of Labor Statistics (1995) March 1995 population survey - classical families. [online] http://www.stat.ucla.edu/data/fpp, 1995
Burns RB (1997) Introduction to research methods, 3rd edn. Addison-Wesley, Reading, MA
Castano S, De Antonellis V (1999) A schema analysis and reconciliation tool environment for heterogeneous databases. In: Abstracts of the international database engineering and applications symposium, Montreal, pp 53-62
Castano S, De Antonellis V, Fugini M, Pernici B (1998) Conceptual schema analysis: techniques and applications. ACM Transactions on database systems, 23(3):286-333
Cdworld (1999) The largest internet discount entertainment store. [online] http://www.cdworld.com/
Chiang RHL, Baron TM, Storey VC (1996) A framework for the design and evaluation of reverse engineering methods for relational databases. Data Knowledge Eng 21(1):57-77
Chua C, Chiang RHL, Lim E-P (1998) A heuristic method for correlating attribute group pairs in data mining. In: Abstracts of the international workshop on data warehousing and data mining, (DWDM'98), Singapore, pp 29-40
StatLib (1998) [online] http://www.stat.cmu.edu/
Cohen J (1988) Statistical power analysis for the behavioral sciences, 2nd edn. Erlbaum, Mahwah, NJ
Cohen WW (1998) Integration of heterogeneous databases without common domains using queries based on textual similarity. In: ACM SIGMOD conference on management of data, Seattle, WA, pp 201-212
Cook TD, Campbell DT (1979) Quasi-experimentation: design and analysis issues for field settings. Houghton Mifflin, Boston, MA
Cox LH, Johnson MM, Kafadar K (1982) Exposition of statistical graphics technology. In: Abstracts of ASA statistical computation section, Cincinnati, OH, pp 55-56
Dao SK, Perry B (1995) Applying a data miner to heterogeneous schema integration. In: The first international conference on knowledge discovery and data mining, Montreal, pp 63-101
Dey D, Sarkar S, De P (1998) A probabilistic decision model for entity matching in heterogeneous databases. Manage Sci 44(10):1379-1395
Donoho D, Ramos E (1982) PRIMDATA: data sets for use with PRIM-H. [online] http://www.stat.cmu.edu/datasets/
Draper NR, Smith H (1981) Applied regression analysis. Wiley, New York
Everitt BS (1977) The analysis of contingency tables. Chapman & Hall, London
Everitt BS (1980) Cluster analysis. Heinemann, Portsmouth, NH
Fadous R, Forsyth J (1975) Finding candidate keys for relational data bases. In: Abstracts of the ACM-SIGMOD international conference on management of data, San Jose, CA, pp 204-210
Fienberg SE, Makov UE, Sanil AP (1994) A bayesian approach to data disclosure: optimal intruder behavior for continuous data. Technical report 11/94, Carnegie-Mellon University, Pittsburgh, PA
John Fox (1997) Applied regression analysis, linear models, and related methods. Sage Publications, Thousand Oaks, CA
Ganesh M, Srivastava J, Richardson T (1996) Mining entity-identification rules for database integration. In: Abstracts of the second international conference on knowledge discovery and data mining (KDD-96), Portland, OR, pp 291-294
Greaney V, Kelleghan T (1984) Equality of opportunity in Irish schools. Educational Company, Dublin
Hair Jr JF, Anderson RE, Tatham RL, Black WC (1998) Multivariate data analysis with readings, 5th edn. Prentice-Hall, New York
Heston A, Summers R (1991) The penn world table (mark 5): an expanded set of international comparisons, 1950-1988. Q J Econom 8(6):327-368
Jaccard J, Becker MA (1990) Statistics for the behavioral sciences, 2nd edn. Wadsworth, Boston, MA
Kendall M, Gibbons JD (1990) Rank correlation methods, 5th edn. Oxford University Press, Oxford
Kim W, Seo J (1991) Classifying schematic and data heterogeneity in multidatabase systems. IEEE Comput 24(12):12-18
Kohavi R (1996) Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid. In: Abstracts of the second international conference on knowledge discovery and data mining, Portland, OR, pp 202-207
Larson JA, Navathe SB, Elmasri R (1989) A theory of attribute equivalence in databases with application to schema integration. IEEE Transactions on software engineering, 15(4):449-463
Lehman RS (1988) Statistics and research design in the behavioral sciences. Wadsworth, Boston, MA
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Cybern Control Theory 10(8):707-710
Li W-S, Clifton C (1993) Using field specifications to determine attribute equivalence in heterogeneous databases. In: Abstracts of the third international workshop on research issues on data engineering: interoperability in multidatabase systems, Vienna, pp 174-177
Li W-S, Clifton C (1994) Semantic integration in heterogeneous databases. In: Abstracts of the 20th VLDB conference, Santiago, Chile, pp 1-12
Li W-S, Clifton C (2000) SEMINT: a tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data Knowledge Eng 33(1):49-84
Lim E-P, Chiang RHL, Cao Y-Y (1999) Tuple source relational model: a source-aware data model for multidatabases. Data Knowledge Eng 29(1):83-114
Lim E-P, Srivastava J (1993) Entity identification in database integration: an evidential reasoning approach. In: Abstracts from the international symposium on next generation database systems and their applications, Fukuoka, Japan, pp 151-158
Lim E-P, Srivastava J, Prabhakar S, Richardson J (1993) Entity identification in database integration. In: Abstracts of the ninth international conference on data engineering, Vienna, pp 294-301
Long JS (1997) Regression models for categorical and limited dependent variables. Sage Publications, Thousand Oaks, CA
Lucchesi CL, Osborn SL (1978) Candidate keys for relations. J Comput Sys Sci 17(2):270-279
Martello S, Toth P (1987) Linear assignment problems. Ann Discrete Math 31:259-282
Mass music-earth's largest music store (1999) [online] http://massmusic.com/cgi-bin/mr.cgi?page=index.html
Miller GA, Beckwith R, Fellbaum C, Gross D, Miller KJ (1990) Introduction to wordnet: an on-line lexical database. Int J Lexicog 3(4):235-244
Mirbel I (1997) Semantic integration of conceptual schemas. Data Knowledge Eng 21(2):183-195
Monge AE, Elkan CP (1996) The field matching problem: algorithms and applications. In: Abstracts of the second international conference on knowledge and data mining, Portland, OR, pp 267-270
Musicforce.com | welcome (1999) [online] http://www.musicforce.com/
Neter J, Wasserman W, Kutner MH (1989) Applied linear regression models, 2nd edn. Irwin, Homewood, IL
Riddim music (1999) [online] http://www.riddim.com/
Scheuermann P, Li W-S, Clifton C (1998) Multidatabase query processing in global keys and attribute values. J Am Soc Inform Sci 49(3):283-301
Scott PD, Coxon APM, Hobbs MH, Williams RJ (1997) SNOUT: an intelligent assistant for exploratory data analysis. In: Abstracts from first European symposium on principles of data mining and knowledge discovery PKDD '97, Trondheim, Norway, pp 189-199
Sekaran U (1992) Research methods for business: a skills building approach. Wiley, New York
Sheth AP, Gala SK (1989) Attribute relationships: an impediment in automating schema integration. In: Abstracts from the workshop on heterogeneous database systems, Evanston, IL, pp 1-7
Singh MP, Cannata PE, Huhns MN, Jacobs N,(1997) The Carnot heterogeneous database project: implemented applications. Distributed Parallel Databases J 5(2):207-225
Stephen GA (1992) String search. Technical report TR-92-gas-01, School of Electronic Engineering Science, University College of North Wales
X-radio.com: The internet's number one electronic music store (1999) [online] http://www.x-radio.com/
Yu CT, Jia B, Sun W, Dao SK (1991) Determining relationships among names in heterogeneous databases. ACM SIGMOD Record 20(4):79-80
Yu CT, Sun W, Dao S, Keirsey D (1990) Determining relationships among attributes for interoperability of multi-database systems. In: Abstracts of the workshop on multi-database and semantic interoperability, Tulsa, OK, pp 251-257
Zhao JL (1997) Schema coordination in federated database management: a comparison with schema integration. Decision Support Sys 20(3):243-257
Author information
Authors and Affiliations
Corresponding author
Additional information
Received: 30 August 2001, Accepted: 31 August 2002, Published online: 31 July 2003
Edited by L. Raschid
Rights and permissions
About this article
Cite this article
Chua, C.E.H., Chiang, R.H.L. & Lim, EP. Instance-based attribute identification in database integration. VLDB 12, 228–243 (2003). https://doi.org/10.1007/s00778-003-0088-y
Issue Date:
DOI: https://doi.org/10.1007/s00778-003-0088-y