Skip to main content
Log in

Supporting Efficient Record Linkage for Large Data Sets Using Mapping Techniques

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

This paper describes an efficient approach to record linkage. Given two lists of records, the record-linkage problem consists of determining all pairs that are similar to each other, where the overall similarity between two records is defined based on domain-specific similarities over individual attributes. The record-linkage problem arises naturally in the context of data cleansing that usually precedes data analysis and mining. Since the scalability issue of record linkage was addressed in [21], the repertoire of database techniques dealing with multidimensional data sets has significantly increased. Specifically, many effective and efficient approaches for distance-preserving transforms and similarity joins have been developed. Based on these advances, we explore a novel approach to record linkage. For each attribute of records, we first map values to a multidimensional Euclidean space that preserves domain-specific similarity. Many mapping algorithms can be applied, and we use the Fastmap approach [16] as an example. Given the merging rule that defines when two records are similar based on their attribute-level similarities, a set of attributes are chosen along which the merge will proceed. A multidimensional similarity join over the chosen attributes is used to find similar pairs of records. Our extensive experiments using real data sets show that our solution has very good efficiency and recall.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Alsabti, K., Ranka, S., Singh, V.: An efficient parallel algorithm for high dimensional similarity join. In IPPS: 11th International Parallel Processing Symposium. IEEE Computer Society Press, Los Alamitos, CA (1998)

    Google Scholar 

  2. Aoki, P.M.: Algorithms for index-assisted selectivity estimation. In ICDE, p. 258 (1999)

  3. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley, Reading, MA (1999)

    Google Scholar 

  4. Bellman, R.E.: Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton, NJ (1961)

    MATH  Google Scholar 

  5. Bilenko, M., Mooney, R.J.: Learning to combine trained distance metrics for duplicate detection in databases. Technical report, Computer Science Department, University of Texas, Austin, TX (2002)

  6. Bourgain, J.: On lipschitz embedding of finite metric spaces in hilbert space. Isr. J. Math. 52(1–2), 46–52 (1985)

    MathSciNet  MATH  Google Scholar 

  7. Brinkhoff, T., Kriegel, H.-P., Seeger, B.: Efficient processing of spatial joins using r-trees. In SIGMOD, pp. 237–246 (1993)

  8. Castelli, V., Bergman, L.D. (eds.): Image Databases: Search and Retrieval of Digital Imagery. Wiley, New York (2001)

    Google Scholar 

  9. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In SIGMOD (2003)

  10. Chvez, E., Navarro, G., Baeza-Yates, R., Marroquln, J.L.: Proximity searching in metric spaces. ACM Comput. Surv. 33(3), 273–321 (2001) (Sept.)

    Article  Google Scholar 

  11. Cohen, W.W., Kautz, H.A., McAllester, D.A.: Hardening soft information sources. In Knowledge Discovery and Data Mining, pp. 255–259 (2000)

  12. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In Workshop on Information Integration on the Web (2003)

  13. Cormode, G., Muthukrishnan, S.: The string edit distance matching problem with moves. Technical report, Rutgers University, Rutgers, NJ (2001)

  14. Corral, A., Manolopoulos, Y., Theodoridis, Y., Vassilakopoulos, M.: Closest pair queries in spatial databases. In SIGMOD, pp. 189–200 (2000)

  15. Elfeky, M.G., Verykios, V.S., Elmagarmid, A.K.: Tailor: A record linkage toolbox. In ICDE (2002)

  16. Faloutsos, C., Lin, K.-I.: Fastmap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Carey, M.J., Schneider, D.A. (eds.) SIGMOD, pp. 163–174 (1995)

  17. Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.-A.: Declarative data cleaning: Language, model, and algorithms. In VLDB, pp. 371–380, (2001)

  18. Garey, M.R., Johnson, D.S.: Computers and Intractability, a Guide to the Theory of NP-Completeness. Freeman, San Francisco, CA (1991)

    Google Scholar 

  19. Gower, J.C., Legendre, P.: Metric and euclidean properties of dissimilarity coefficients. J. Classif. 3, 5–48 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  20. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In VLDB, pp. 491–500 (2001)

  21. Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In Carey, M.J., Schneider, D.A. (eds.) SIGMOD, pp. 127–138 (1995)

  22. Hernández, M.A., Stolfo, S.J.: An incremental merge/purge procedure. Technical report, University of Illinois, Illinois (2000)

  23. Hjaltason, G.R., Samet, H.: Incremental distance join algorithms for spatial databases. In Haas, L.M., Tiwary, A. (eds.) SIGMOD, pp. 237–248 (1998)

  24. Hjaltason, G.R., Samet, H.: Contractive embedding methods for similarity searching in metric spaces. Technical report, University of Maryland Computer Science, Maryland (2000)

  25. Hristescu, G., Farach-Colton, M.: Cluster-preserving embedding of proteins. Technical report 99-50, Rutgers University, Rutgers, NJ, 8 (1999)

  26. Huang, Y.-W., Jing, N., Rundensteiner, E.A.: A cost model for estimating the performance of spatial joins using r-trees. In Statistical and Scientific Database Management, pp. 30–38 (1997)

  27. Indyk, P.: Sublinear time algorithms for metric space problems. In STOC, pp. 428–434 (1999)

  28. Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In Eighth International Conference on Database Systems for Advanced Applications (DASFAA ’03), Kyoto, Japan (2003)

  29. Kamps, J.: Exploiting keyword structure for domain-specific retrieval. In Cross Language Evaluation Forum (2002)

  30. Koudas, N., Sevcik, K.C.: Size separation spatial join. In SIGMOD, pp. 324–335 (1997)

  31. Kruskal, J.B., Wish, M.: Multidimensional Scaling. Sage, Beverly Hills, CA (1978)

    Google Scholar 

  32. Kukich, K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24(4), 377–439 (1992)

    Article  Google Scholar 

  33. Lee, M.-L., Ling, T.W., Low, W.L.: Intelliclean: a knowledge-based intelligent data cleaner. In Knowledge Discovery and Data Mining, pp. 290–294 (2000)

  34. Lee, M.-L., Ling, T.W., Lu, H., Ko, Y.T.: Cleansing data for mining and warehousing. In Database and Expert Systems Applications (DEXA), pp. 751–760 (1999)

  35. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10, 707–710 (1966)

    MathSciNet  Google Scholar 

  36. Ley, M.: DBLP bibliography. http://dblp.uni-trier.de/

  37. Loshin, D.: Value added data: merge ahead. Intell. Enterp. 3(3), (2000)

  38. Mamoulis, N., Papadias, D.: Integration of spatial join algorithms for processing multiple inputs. In SIGMOD, pp. 1–12 (1999)

  39. Raman, V., Hellerstein, J.M.: Potter’s wheel: an interactive data cleaning system. In The VLDB Journal, pp. 381–390 (2001)

  40. Roussopoulos, N., Kelley, S., Vincent, F.: Nearest neighbor queries. In SIGMOD, pp. 71–79 (1995)

  41. Sarawagi, S., Bhamidipaty, A., Kirpal, A., Mouli, C.: Alias: An active learning led interactive deduplication system. In Proc. of the 28th Int’l Conference on Very Large Databases (VLDB) (Demonstration session), Hongkong, August (2002)

  42. Sevcik, K.C., Koudas, N.: High dimensional similarity joins: algorithms and performance evaluation. TKDE 12(1), 3–18 (2000)

    Google Scholar 

  43. Shim, K., Srikant, R., Agrawal, R.: High-dimensional similarity joins. In ICDE, pp. 301–311 (1997)

  44. Shin, H., Moon, B., Lee, S.: Adaptive multi-stage distance join processing. In SIGMOD, pp. 343–354 (2000)

  45. Shinohara, T., An, J., Ishizaka, H.: Approximate retrieval of high-dimensional data with L1 metric by spatial indexing. J. New Gener. Comput. Sys. 18(1), 39–47 (2000)

    Article  Google Scholar 

  46. Wang, J.T.-L., Wang, X., Lin, K.-I., Shasha, D., Shapiro, B.A., Zhang, K.: Evaluating a class of distance-mapping algorithms for data mining and clustering. In Knowledge Discovery and Data Mining, pp. 307–311 (1999)

  47. Winkler, W.: Advanced methods for record linkage. Technical report, Statistical Research Division, Washington, DC: US Bureau of the Census (1994)

  48. Yianilos, P.N., Kanzelberger, K.G.: The likeit intelligent string comparison facility. Technical report, NEC Research Institute (1997)

  49. Young, F.W., Hamer, R.M.: Multidimensional Scaling: History. Theory and Applications Erlbaum, New York (1987)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chen Li.

Additional information

Part of this article was published in [28]. In addition to the prior materials, this article contains more analysis, a complete proof, and more experimental results that were not included in the original paper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, C., Jin, L. & Mehrotra, S. Supporting Efficient Record Linkage for Large Data Sets Using Mapping Techniques. World Wide Web 9, 557–584 (2006). https://doi.org/10.1007/s11280-006-0226-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-006-0226-8

Keywords

Navigation