Skip to main content
Log in

Instance-based attribute identification in database integration

  • OriginalPaper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract.

Most research on attribute identification in database integration has focused on integrating attributes using schema and summary information derived from the attribute values. No research has attempted to fully explore the use of attribute values to perform attribute identification. We propose an attribute identification method that employs schema and summary instance information as well as properties of attributes derived from their instances. Unlike other attribute identification methods that match only single attributes, our method matches attribute groups for integration. Because our attribute identification method fully explores data instances, it can identify corresponding attributes to be integrated even when schema information is misleading. Three experiments were performed to validate our attribute identification method. In the first experiment, the heuristic rules derived for attribute classification were evaluated on 119 attributes from nine public domain data sets. The second was a controlled experiment validating the robustness of the proposed attribute identification method by introducing erroneous data. The third experiment evaluated the proposed attribute identification method on five data sets extracted from online music stores. The results demonstrated the viability of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aggarwal CC, Yu PS (1998) Mining large itemsets for association rules. Bulletin of the IEEE Computer Society technical committee on data engineering, 2(1):23-31

  2. Bekker PA, Merckens A, Wansbeek TJ (1994) Identification, equivalent models, and computer algebra. Academic Press, New York

  3. Berndt ER (1991) Determinants of wages from the 1985 current population survey. In: The practice of econometrics: classic and contemporary, Chap. 5, Addison-Wesley, Reading, MA, pp 193-209 [online] http://www.stat.cmu.edu/datasets/.

  4. Biblarz TJ, Raftery AE (1993) The effects of family disruption on social mobility. Am Sociol Rev 58:97-109

    Google Scholar 

  5. Blake C, Keogh E, Merz CJ (1998) UCI repository of machine learning databases. [online] http://www.ics.uci.edu/ mlearn/MLRepository.html

  6. Bureau of Labor Statistics (1995) March 1995 population survey - classical families. [online] http://www.stat.ucla.edu/data/fpp, 1995

  7. Burns RB (1997) Introduction to research methods, 3rd edn. Addison-Wesley, Reading, MA

  8. Castano S, De Antonellis V (1999) A schema analysis and reconciliation tool environment for heterogeneous databases. In: Abstracts of the international database engineering and applications symposium, Montreal, pp 53-62

  9. Castano S, De Antonellis V, Fugini M, Pernici B (1998) Conceptual schema analysis: techniques and applications. ACM Transactions on database systems, 23(3):286-333

    Google Scholar 

  10. Cdworld (1999) The largest internet discount entertainment store. [online] http://www.cdworld.com/

  11. Chiang RHL, Baron TM, Storey VC (1996) A framework for the design and evaluation of reverse engineering methods for relational databases. Data Knowledge Eng 21(1):57-77

    Article  MATH  Google Scholar 

  12. Chua C, Chiang RHL, Lim E-P (1998) A heuristic method for correlating attribute group pairs in data mining. In: Abstracts of the international workshop on data warehousing and data mining, (DWDM'98), Singapore, pp 29-40

  13. StatLib (1998) [online] http://www.stat.cmu.edu/

  14. Cohen J (1988) Statistical power analysis for the behavioral sciences, 2nd edn. Erlbaum, Mahwah, NJ

  15. Cohen WW (1998) Integration of heterogeneous databases without common domains using queries based on textual similarity. In: ACM SIGMOD conference on management of data, Seattle, WA, pp 201-212

  16. Cook TD, Campbell DT (1979) Quasi-experimentation: design and analysis issues for field settings. Houghton Mifflin, Boston, MA

  17. Cox LH, Johnson MM, Kafadar K (1982) Exposition of statistical graphics technology. In: Abstracts of ASA statistical computation section, Cincinnati, OH, pp 55-56

  18. Dao SK, Perry B (1995) Applying a data miner to heterogeneous schema integration. In: The first international conference on knowledge discovery and data mining, Montreal, pp 63-101

  19. Dey D, Sarkar S, De P (1998) A probabilistic decision model for entity matching in heterogeneous databases. Manage Sci 44(10):1379-1395

    MATH  Google Scholar 

  20. Donoho D, Ramos E (1982) PRIMDATA: data sets for use with PRIM-H. [online] http://www.stat.cmu.edu/datasets/

  21. Draper NR, Smith H (1981) Applied regression analysis. Wiley, New York

  22. Everitt BS (1977) The analysis of contingency tables. Chapman & Hall, London

  23. Everitt BS (1980) Cluster analysis. Heinemann, Portsmouth, NH

  24. Fadous R, Forsyth J (1975) Finding candidate keys for relational data bases. In: Abstracts of the ACM-SIGMOD international conference on management of data, San Jose, CA, pp 204-210

  25. Fienberg SE, Makov UE, Sanil AP (1994) A bayesian approach to data disclosure: optimal intruder behavior for continuous data. Technical report 11/94, Carnegie-Mellon University, Pittsburgh, PA

    Google Scholar 

  26. John Fox (1997) Applied regression analysis, linear models, and related methods. Sage Publications, Thousand Oaks, CA

  27. Ganesh M, Srivastava J, Richardson T (1996) Mining entity-identification rules for database integration. In: Abstracts of the second international conference on knowledge discovery and data mining (KDD-96), Portland, OR, pp 291-294

  28. Greaney V, Kelleghan T (1984) Equality of opportunity in Irish schools. Educational Company, Dublin

  29. Hair Jr JF, Anderson RE, Tatham RL, Black WC (1998) Multivariate data analysis with readings, 5th edn. Prentice-Hall, New York

  30. Heston A, Summers R (1991) The penn world table (mark 5): an expanded set of international comparisons, 1950-1988. Q J Econom 8(6):327-368

    Google Scholar 

  31. Jaccard J, Becker MA (1990) Statistics for the behavioral sciences, 2nd edn. Wadsworth, Boston, MA

  32. Kendall M, Gibbons JD (1990) Rank correlation methods, 5th edn. Oxford University Press, Oxford

  33. Kim W, Seo J (1991) Classifying schematic and data heterogeneity in multidatabase systems. IEEE Comput 24(12):12-18

    Article  MATH  Google Scholar 

  34. Kohavi R (1996) Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid. In: Abstracts of the second international conference on knowledge discovery and data mining, Portland, OR, pp 202-207

  35. Larson JA, Navathe SB, Elmasri R (1989) A theory of attribute equivalence in databases with application to schema integration. IEEE Transactions on software engineering, 15(4):449-463

    Google Scholar 

  36. Lehman RS (1988) Statistics and research design in the behavioral sciences. Wadsworth, Boston, MA

  37. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Cybern Control Theory 10(8):707-710

    Google Scholar 

  38. Li W-S, Clifton C (1993) Using field specifications to determine attribute equivalence in heterogeneous databases. In: Abstracts of the third international workshop on research issues on data engineering: interoperability in multidatabase systems, Vienna, pp 174-177

  39. Li W-S, Clifton C (1994) Semantic integration in heterogeneous databases. In: Abstracts of the 20th VLDB conference, Santiago, Chile, pp 1-12

  40. Li W-S, Clifton C (2000) SEMINT: a tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data Knowledge Eng 33(1):49-84

    Article  MATH  Google Scholar 

  41. Lim E-P, Chiang RHL, Cao Y-Y (1999) Tuple source relational model: a source-aware data model for multidatabases. Data Knowledge Eng 29(1):83-114

    Article  MATH  Google Scholar 

  42. Lim E-P, Srivastava J (1993) Entity identification in database integration: an evidential reasoning approach. In: Abstracts from the international symposium on next generation database systems and their applications, Fukuoka, Japan, pp 151-158

  43. Lim E-P, Srivastava J, Prabhakar S, Richardson J (1993) Entity identification in database integration. In: Abstracts of the ninth international conference on data engineering, Vienna, pp 294-301

  44. Long JS (1997) Regression models for categorical and limited dependent variables. Sage Publications, Thousand Oaks, CA

  45. Lucchesi CL, Osborn SL (1978) Candidate keys for relations. J Comput Sys Sci 17(2):270-279

    MathSciNet  MATH  Google Scholar 

  46. Martello S, Toth P (1987) Linear assignment problems. Ann Discrete Math 31:259-282

    MATH  Google Scholar 

  47. Mass music-earth's largest music store (1999) [online] http://massmusic.com/cgi-bin/mr.cgi?page=index.html

  48. Miller GA, Beckwith R, Fellbaum C, Gross D, Miller KJ (1990) Introduction to wordnet: an on-line lexical database. Int J Lexicog 3(4):235-244

    MATH  Google Scholar 

  49. Mirbel I (1997) Semantic integration of conceptual schemas. Data Knowledge Eng 21(2):183-195

    Article  MATH  Google Scholar 

  50. Monge AE, Elkan CP (1996) The field matching problem: algorithms and applications. In: Abstracts of the second international conference on knowledge and data mining, Portland, OR, pp 267-270

  51. Musicforce.com | welcome (1999) [online] http://www.musicforce.com/

  52. Neter J, Wasserman W, Kutner MH (1989) Applied linear regression models, 2nd edn. Irwin, Homewood, IL

  53. Riddim music (1999) [online] http://www.riddim.com/

  54. Scheuermann P, Li W-S, Clifton C (1998) Multidatabase query processing in global keys and attribute values. J Am Soc Inform Sci 49(3):283-301

    Article  Google Scholar 

  55. Scott PD, Coxon APM, Hobbs MH, Williams RJ (1997) SNOUT: an intelligent assistant for exploratory data analysis. In: Abstracts from first European symposium on principles of data mining and knowledge discovery PKDD '97, Trondheim, Norway, pp 189-199

  56. Sekaran U (1992) Research methods for business: a skills building approach. Wiley, New York

    Google Scholar 

  57. Sheth AP, Gala SK (1989) Attribute relationships: an impediment in automating schema integration. In: Abstracts from the workshop on heterogeneous database systems, Evanston, IL, pp 1-7

  58. Singh MP, Cannata PE, Huhns MN, Jacobs N,(1997) The Carnot heterogeneous database project: implemented applications. Distributed Parallel Databases J 5(2):207-225

    Article  MATH  Google Scholar 

  59. Stephen GA (1992) String search. Technical report TR-92-gas-01, School of Electronic Engineering Science, University College of North Wales

  60. X-radio.com: The internet's number one electronic music store (1999) [online] http://www.x-radio.com/

  61. Yu CT, Jia B, Sun W, Dao SK (1991) Determining relationships among names in heterogeneous databases. ACM SIGMOD Record 20(4):79-80

    Google Scholar 

  62. Yu CT, Sun W, Dao S, Keirsey D (1990) Determining relationships among attributes for interoperability of multi-database systems. In: Abstracts of the workshop on multi-database and semantic interoperability, Tulsa, OK, pp 251-257

  63. Zhao JL (1997) Schema coordination in federated database management: a comparison with schema integration. Decision Support Sys 20(3):243-257

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cecil Eng H. Chua.

Additional information

Received: 30 August 2001, Accepted: 31 August 2002, Published online: 31 July 2003

Edited by L. Raschid

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chua, C.E.H., Chiang, R.H.L. & Lim, EP. Instance-based attribute identification in database integration. VLDB 12, 228–243 (2003). https://doi.org/10.1007/s00778-003-0088-y

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-003-0088-y

Keywords:

Navigation