Supporting Efficient Record Linkage for Large Data Sets Using Mapping Techniques

Li, Chen; Jin, Liang; Mehrotra, Sharad

doi:10.1007/s11280-006-0226-8

Supporting Efficient Record Linkage for Large Data Sets Using Mapping Techniques

Published: 16 January 2007

Volume 9, pages 557–584, (2006)
Cite this article

World Wide Web Aims and scope Submit manuscript

Chen Li¹,
Liang Jin¹ &
Sharad Mehrotra¹

137 Accesses
17 Citations
Explore all metrics

Abstract

This paper describes an efficient approach to record linkage. Given two lists of records, the record-linkage problem consists of determining all pairs that are similar to each other, where the overall similarity between two records is defined based on domain-specific similarities over individual attributes. The record-linkage problem arises naturally in the context of data cleansing that usually precedes data analysis and mining. Since the scalability issue of record linkage was addressed in [21], the repertoire of database techniques dealing with multidimensional data sets has significantly increased. Specifically, many effective and efficient approaches for distance-preserving transforms and similarity joins have been developed. Based on these advances, we explore a novel approach to record linkage. For each attribute of records, we first map values to a multidimensional Euclidean space that preserves domain-specific similarity. Many mapping algorithms can be applied, and we use the Fastmap approach [16] as an example. Given the merging rule that defines when two records are similar based on their attribute-level similarities, a set of attributes are chosen along which the merge will proceed. A multidimensional similarity join over the chosen attributes is used to find similar pairs of records. Our extensive experiments using real data sets show that our solution has very good efficiency and recall.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey of transfer learning

Article Open access 28 May 2016

Data clustering: application and trends

Article 27 November 2022

Big data preprocessing: methods and prospects

Article Open access 01 November 2016

References

Alsabti, K., Ranka, S., Singh, V.: An efficient parallel algorithm for high dimensional similarity join. In IPPS: 11th International Parallel Processing Symposium. IEEE Computer Society Press, Los Alamitos, CA (1998)
Google Scholar
Aoki, P.M.: Algorithms for index-assisted selectivity estimation. In ICDE, p. 258 (1999)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley, Reading, MA (1999)
Google Scholar
Bellman, R.E.: Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton, NJ (1961)
MATH Google Scholar
Bilenko, M., Mooney, R.J.: Learning to combine trained distance metrics for duplicate detection in databases. Technical report, Computer Science Department, University of Texas, Austin, TX (2002)
Bourgain, J.: On lipschitz embedding of finite metric spaces in hilbert space. Isr. J. Math. 52(1–2), 46–52 (1985)
MathSciNet MATH Google Scholar
Brinkhoff, T., Kriegel, H.-P., Seeger, B.: Efficient processing of spatial joins using r-trees. In SIGMOD, pp. 237–246 (1993)
Castelli, V., Bergman, L.D. (eds.): Image Databases: Search and Retrieval of Digital Imagery. Wiley, New York (2001)
Google Scholar
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In SIGMOD (2003)
Chvez, E., Navarro, G., Baeza-Yates, R., Marroquln, J.L.: Proximity searching in metric spaces. ACM Comput. Surv. 33(3), 273–321 (2001) (Sept.)
Article Google Scholar
Cohen, W.W., Kautz, H.A., McAllester, D.A.: Hardening soft information sources. In Knowledge Discovery and Data Mining, pp. 255–259 (2000)
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In Workshop on Information Integration on the Web (2003)
Cormode, G., Muthukrishnan, S.: The string edit distance matching problem with moves. Technical report, Rutgers University, Rutgers, NJ (2001)
Corral, A., Manolopoulos, Y., Theodoridis, Y., Vassilakopoulos, M.: Closest pair queries in spatial databases. In SIGMOD, pp. 189–200 (2000)
Elfeky, M.G., Verykios, V.S., Elmagarmid, A.K.: Tailor: A record linkage toolbox. In ICDE (2002)
Faloutsos, C., Lin, K.-I.: Fastmap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Carey, M.J., Schneider, D.A. (eds.) SIGMOD, pp. 163–174 (1995)
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.-A.: Declarative data cleaning: Language, model, and algorithms. In VLDB, pp. 371–380, (2001)
Garey, M.R., Johnson, D.S.: Computers and Intractability, a Guide to the Theory of NP-Completeness. Freeman, San Francisco, CA (1991)
Google Scholar
Gower, J.C., Legendre, P.: Metric and euclidean properties of dissimilarity coefficients. J. Classif. 3, 5–48 (1986)
Article MathSciNet MATH Google Scholar
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In VLDB, pp. 491–500 (2001)
Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In Carey, M.J., Schneider, D.A. (eds.) SIGMOD, pp. 127–138 (1995)
Hernández, M.A., Stolfo, S.J.: An incremental merge/purge procedure. Technical report, University of Illinois, Illinois (2000)
Hjaltason, G.R., Samet, H.: Incremental distance join algorithms for spatial databases. In Haas, L.M., Tiwary, A. (eds.) SIGMOD, pp. 237–248 (1998)
Hjaltason, G.R., Samet, H.: Contractive embedding methods for similarity searching in metric spaces. Technical report, University of Maryland Computer Science, Maryland (2000)
Hristescu, G., Farach-Colton, M.: Cluster-preserving embedding of proteins. Technical report 99-50, Rutgers University, Rutgers, NJ, 8 (1999)
Huang, Y.-W., Jing, N., Rundensteiner, E.A.: A cost model for estimating the performance of spatial joins using r-trees. In Statistical and Scientific Database Management, pp. 30–38 (1997)
Indyk, P.: Sublinear time algorithms for metric space problems. In STOC, pp. 428–434 (1999)
Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In Eighth International Conference on Database Systems for Advanced Applications (DASFAA ’03), Kyoto, Japan (2003)
Kamps, J.: Exploiting keyword structure for domain-specific retrieval. In Cross Language Evaluation Forum (2002)
Koudas, N., Sevcik, K.C.: Size separation spatial join. In SIGMOD, pp. 324–335 (1997)
Kruskal, J.B., Wish, M.: Multidimensional Scaling. Sage, Beverly Hills, CA (1978)
Google Scholar
Kukich, K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24(4), 377–439 (1992)
Article Google Scholar
Lee, M.-L., Ling, T.W., Low, W.L.: Intelliclean: a knowledge-based intelligent data cleaner. In Knowledge Discovery and Data Mining, pp. 290–294 (2000)
Lee, M.-L., Ling, T.W., Lu, H., Ko, Y.T.: Cleansing data for mining and warehousing. In Database and Expert Systems Applications (DEXA), pp. 751–760 (1999)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10, 707–710 (1966)
MathSciNet Google Scholar
Ley, M.: DBLP bibliography. http://dblp.uni-trier.de/
Loshin, D.: Value added data: merge ahead. Intell. Enterp. 3(3), (2000)
Mamoulis, N., Papadias, D.: Integration of spatial join algorithms for processing multiple inputs. In SIGMOD, pp. 1–12 (1999)
Raman, V., Hellerstein, J.M.: Potter’s wheel: an interactive data cleaning system. In The VLDB Journal, pp. 381–390 (2001)
Roussopoulos, N., Kelley, S., Vincent, F.: Nearest neighbor queries. In SIGMOD, pp. 71–79 (1995)
Sarawagi, S., Bhamidipaty, A., Kirpal, A., Mouli, C.: Alias: An active learning led interactive deduplication system. In Proc. of the 28th Int’l Conference on Very Large Databases (VLDB) (Demonstration session), Hongkong, August (2002)
Sevcik, K.C., Koudas, N.: High dimensional similarity joins: algorithms and performance evaluation. TKDE 12(1), 3–18 (2000)
Google Scholar
Shim, K., Srikant, R., Agrawal, R.: High-dimensional similarity joins. In ICDE, pp. 301–311 (1997)
Shin, H., Moon, B., Lee, S.: Adaptive multi-stage distance join processing. In SIGMOD, pp. 343–354 (2000)
Shinohara, T., An, J., Ishizaka, H.: Approximate retrieval of high-dimensional data with L1 metric by spatial indexing. J. New Gener. Comput. Sys. 18(1), 39–47 (2000)
Article Google Scholar
Wang, J.T.-L., Wang, X., Lin, K.-I., Shasha, D., Shapiro, B.A., Zhang, K.: Evaluating a class of distance-mapping algorithms for data mining and clustering. In Knowledge Discovery and Data Mining, pp. 307–311 (1999)
Winkler, W.: Advanced methods for record linkage. Technical report, Statistical Research Division, Washington, DC: US Bureau of the Census (1994)
Yianilos, P.N., Kanzelberger, K.G.: The likeit intelligent string comparison facility. Technical report, NEC Research Institute (1997)
Young, F.W., Hamer, R.M.: Multidimensional Scaling: History. Theory and Applications Erlbaum, New York (1987)

Download references

Author information

Authors and Affiliations

ICS 424B, University of California, Irvine, CA, 92697, USA
Chen Li, Liang Jin & Sharad Mehrotra

Authors

Chen Li
View author publications
You can also search for this author in PubMed Google Scholar
Liang Jin
View author publications
You can also search for this author in PubMed Google Scholar
Sharad Mehrotra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chen Li.

Additional information

Part of this article was published in [28]. In addition to the prior materials, this article contains more analysis, a complete proof, and more experimental results that were not included in the original paper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, C., Jin, L. & Mehrotra, S. Supporting Efficient Record Linkage for Large Data Sets Using Mapping Techniques. World Wide Web 9, 557–584 (2006). https://doi.org/10.1007/s11280-006-0226-8

Download citation

Received: 05 August 2005
Accepted: 06 July 2006
Published: 16 January 2007
Issue Date: December 2006
DOI: https://doi.org/10.1007/s11280-006-0226-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Supporting Efficient Record Linkage for Large Data Sets Using Mapping Techniques

Abstract

Access this article

Similar content being viewed by others

A survey of transfer learning

Data clustering: application and trends

Big data preprocessing: methods and prospects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Supporting Efficient Record Linkage for Large Data Sets Using Mapping Techniques

Abstract

Access this article

Similar content being viewed by others

A survey of transfer learning

Data clustering: application and trends

Big data preprocessing: methods and prospects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation