Skip to main content

Probabilistic Data Generation for Deduplication and Data Linkage

  • Conference paper
Book cover Intelligent Data Engineering and Automated Learning - IDEAL 2005 (IDEAL 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3578))

Abstract

In many data mining projects the data to be analysed contains personal information, like names and addresses. Cleaning and pre-processing of such data likely involves deduplication or linkage with other data, which is often challenged by a lack of unique entity identifiers. In recent years there has been an increased research effort in data linkage and deduplication, mainly in the machine learning and database communities. Publicly available test data with known deduplication or linkage status is needed so that new linkage algorithms and techniques can be tested, evaluated and compared. However, publication of data containing personal information is normally impossible due to privacy and confidentiality issues. An alternative is to use artificially created data, which has the advantages that content and error rates can be controlled, and the deduplication or linkage status is known. Controlled experiments can be performed and replicated easily. In this paper we present a freely available data set generator capable of creating data sets containing names, addresses and other personal information.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bertolazzi, P., De Santis, L., Scannapieco, M.: Automated record matching in cooperative information systems. In: Proceedings of the international workshop on data quality in cooperative information systems, Siena, Italy (January 2003)

    Google Scholar 

  2. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD conference, Washington, DC (August 2003)

    Google Scholar 

  3. Bilenko, M., Mooney, R.J.: On evaluation and training-set construction for duplicate detection. In: Proceedings of the KDD-2003 workshop on data cleaning, record linkage, and object consolidation, Washington, DC (August 2003)

    Google Scholar 

  4. Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences, http://www.ics.uci.edu/~mlearn/MLRepository.html

  5. Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: Proceedings of the 21st international conference on data engineering, Tokyo (April 2005)

    Google Scholar 

  6. Christen, P., Churches, T., Hegland, M.: A parallel open source data linkage system. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 638–647. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  7. Cohen, W.W., Richman, J.: Learning to match and cluster large highdimensional data sets for data integration. In: Proceedings of the 8th ACM SIGKDD conference, Edmonton (July 2002)

    Google Scholar 

  8. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of IJCAI 2003 workshop on information integration on the Web (IIWeb 2003), Acapulco, August 2003, pp. 73–78 (2003)

    Google Scholar 

  9. Damerau, F.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3), 171–176 (1964)

    Article  Google Scholar 

  10. Elfeky, M.G., Verykios, V.S., Elmagarmid, A.K.: TAILOR: A record linkage toolbox. In: Proceedings of the ICDE 2002, San Jose, USA (March 2002)

    Google Scholar 

  11. Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Society (December 1969)

    Google Scholar 

  12. Gu, L., Baxter, R.: Adaptive filtering for efficient record linkage. In: SIAM international conference on data mining, Orlando, Florida (April 2004)

    Google Scholar 

  13. Hall, P.A.V., Dowling, G.R.: Approximate string matching. ACM computing surveys 12(4), 381–402 (1980)

    Article  MathSciNet  Google Scholar 

  14. Hernandez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of the ACM SIGMOD conference (May 1995)

    Google Scholar 

  15. Kukich, K.: Techniques for automatically correcting words in text. ACM computing surveys 24(4), 377–439 (1992)

    Article  Google Scholar 

  16. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the 6th ACM SIGKDD conference, Boston, August 2000, pp. 169–178 (2000)

    Google Scholar 

  17. Nahm, U.Y., Bilenko, M., Mooney, R.J.: Two approaches to handling noisy variation in text mining. In: Proceedings of the ICML 2002 workshop on text learning (TextML 2002), Sydney, Australia, July 2002, pp. 18–27 (2002)

    Google Scholar 

  18. Centre for Epidemiology and Research, NSW Department of Health. New South Wales Mothers and Babies, NSW Public Health Bull 2002; 13(S-4) (2001)

    Google Scholar 

  19. Pollock, J.J., Zamora, A.: Automatic spelling correction in scientific and scholarly text. Communications of the ACM 27(4), 358–368 (1984)

    Article  Google Scholar 

  20. Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: Proceedings of the 20th conference on uncertainty in artificial intelligence, Banff, Canada (July 2004)

    Google Scholar 

  21. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD conference, Edmonton (July 2002)

    Google Scholar 

  22. Tejada, S., Knoblock, C.A., Minton, S.: Learning domain-independent string transformation weights for high accuracy object identification. In: Proceedings of the 8th ACM SIGKDD conference, Edmonton (July 2002)

    Google Scholar 

  23. Yancey, W.E.: An adaptive string comparator for record linkage RR 2004-02, US Bureau of the Census (February 2004)

    Google Scholar 

  24. Zhu, J.J., Ungar, L.H.: String edit analysis for merging databases. In: KDD 2000 workshop on text mining, held at the 6th ACM SIGKDD conference, Boston (August 2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Christen, P. (2005). Probabilistic Data Generation for Deduplication and Data Linkage. In: Gallagher, M., Hogan, J.P., Maire, F. (eds) Intelligent Data Engineering and Automated Learning - IDEAL 2005. IDEAL 2005. Lecture Notes in Computer Science, vol 3578. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11508069_15

Download citation

  • DOI: https://doi.org/10.1007/11508069_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-26972-4

  • Online ISBN: 978-3-540-31693-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics