Probabilistic Data Generation for Deduplication and Data Linkage

Christen, Peter

doi:10.1007/11508069_15

Peter Christen¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3578))

Included in the following conference series:

International Conference on Intelligent Data Engineering and Automated Learning

1453 Accesses
28 Citations

Abstract

In many data mining projects the data to be analysed contains personal information, like names and addresses. Cleaning and pre-processing of such data likely involves deduplication or linkage with other data, which is often challenged by a lack of unique entity identifiers. In recent years there has been an increased research effort in data linkage and deduplication, mainly in the machine learning and database communities. Publicly available test data with known deduplication or linkage status is needed so that new linkage algorithms and techniques can be tested, evaluated and compared. However, publication of data containing personal information is normally impossible due to privacy and confidentiality issues. An alternative is to use artificially created data, which has the advantages that content and error rates can be controlled, and the deduplication or linkage status is known. Controlled experiments can be performed and replicated easily. In this paper we present a freely available data set generator capable of creating data sets containing names, addresses and other personal information.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bertolazzi, P., De Santis, L., Scannapieco, M.: Automated record matching in cooperative information systems. In: Proceedings of the international workshop on data quality in cooperative information systems, Siena, Italy (January 2003)
Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD conference, Washington, DC (August 2003)
Google Scholar
Bilenko, M., Mooney, R.J.: On evaluation and training-set construction for duplicate detection. In: Proceedings of the KDD-2003 workshop on data cleaning, record linkage, and object consolidation, Washington, DC (August 2003)
Google Scholar
Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences, http://www.ics.uci.edu/~mlearn/MLRepository.html
Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: Proceedings of the 21st international conference on data engineering, Tokyo (April 2005)
Google Scholar
Christen, P., Churches, T., Hegland, M.: A parallel open source data linkage system. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 638–647. Springer, Heidelberg (2004)
Chapter Google Scholar
Cohen, W.W., Richman, J.: Learning to match and cluster large highdimensional data sets for data integration. In: Proceedings of the 8th ACM SIGKDD conference, Edmonton (July 2002)
Google Scholar
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of IJCAI 2003 workshop on information integration on the Web (IIWeb 2003), Acapulco, August 2003, pp. 73–78 (2003)
Google Scholar
Damerau, F.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3), 171–176 (1964)
Article Google Scholar
Elfeky, M.G., Verykios, V.S., Elmagarmid, A.K.: TAILOR: A record linkage toolbox. In: Proceedings of the ICDE 2002, San Jose, USA (March 2002)
Google Scholar
Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Society (December 1969)
Google Scholar
Gu, L., Baxter, R.: Adaptive filtering for efficient record linkage. In: SIAM international conference on data mining, Orlando, Florida (April 2004)
Google Scholar
Hall, P.A.V., Dowling, G.R.: Approximate string matching. ACM computing surveys 12(4), 381–402 (1980)
Article MathSciNet Google Scholar
Hernandez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of the ACM SIGMOD conference (May 1995)
Google Scholar
Kukich, K.: Techniques for automatically correcting words in text. ACM computing surveys 24(4), 377–439 (1992)
Article Google Scholar
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the 6th ACM SIGKDD conference, Boston, August 2000, pp. 169–178 (2000)
Google Scholar
Nahm, U.Y., Bilenko, M., Mooney, R.J.: Two approaches to handling noisy variation in text mining. In: Proceedings of the ICML 2002 workshop on text learning (TextML 2002), Sydney, Australia, July 2002, pp. 18–27 (2002)
Google Scholar
Centre for Epidemiology and Research, NSW Department of Health. New South Wales Mothers and Babies, NSW Public Health Bull 2002; 13(S-4) (2001)
Google Scholar
Pollock, J.J., Zamora, A.: Automatic spelling correction in scientific and scholarly text. Communications of the ACM 27(4), 358–368 (1984)
Article Google Scholar
Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: Proceedings of the 20th conference on uncertainty in artificial intelligence, Banff, Canada (July 2004)
Google Scholar
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD conference, Edmonton (July 2002)
Google Scholar
Tejada, S., Knoblock, C.A., Minton, S.: Learning domain-independent string transformation weights for high accuracy object identification. In: Proceedings of the 8th ACM SIGKDD conference, Edmonton (July 2002)
Google Scholar
Yancey, W.E.: An adaptive string comparator for record linkage RR 2004-02, US Bureau of the Census (February 2004)
Google Scholar
Zhu, J.J., Ungar, L.H.: String edit analysis for merging databases. In: KDD 2000 workshop on text mining, held at the 6th ACM SIGKDD conference, Boston (August 2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Australian National University, Canberra, ACT 0200, Australia
Peter Christen

Authors

Peter Christen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information Technology and Electrical Engineering, University of Queensland, 4072, Australia
Marcus Gallagher
, POB 30031, FL 32503-1031, Pensacola
James P. Hogan
Faculty of Information Technology, Queensland University of Technology, Box 2434, Q 4001, Brisbane, Australia
Frederic Maire

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Christen, P. (2005). Probabilistic Data Generation for Deduplication and Data Linkage. In: Gallagher, M., Hogan, J.P., Maire, F. (eds) Intelligent Data Engineering and Automated Learning - IDEAL 2005. IDEAL 2005. Lecture Notes in Computer Science, vol 3578. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11508069_15

Download citation

DOI: https://doi.org/10.1007/11508069_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26972-4
Online ISBN: 978-3-540-31693-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics