Probabilistic Data Generation for Deduplication and Data Linkage

  • Peter Christen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3578)

Abstract

In many data mining projects the data to be analysed contains personal information, like names and addresses. Cleaning and pre-processing of such data likely involves deduplication or linkage with other data, which is often challenged by a lack of unique entity identifiers. In recent years there has been an increased research effort in data linkage and deduplication, mainly in the machine learning and database communities. Publicly available test data with known deduplication or linkage status is needed so that new linkage algorithms and techniques can be tested, evaluated and compared. However, publication of data containing personal information is normally impossible due to privacy and confidentiality issues. An alternative is to use artificially created data, which has the advantages that content and error rates can be controlled, and the deduplication or linkage status is known. Controlled experiments can be performed and replicated easily. In this paper we present a freely available data set generator capable of creating data sets containing names, addresses and other personal information.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bertolazzi, P., De Santis, L., Scannapieco, M.: Automated record matching in cooperative information systems. In: Proceedings of the international workshop on data quality in cooperative information systems, Siena, Italy (January 2003)Google Scholar
  2. 2.
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD conference, Washington, DC (August 2003)Google Scholar
  3. 3.
    Bilenko, M., Mooney, R.J.: On evaluation and training-set construction for duplicate detection. In: Proceedings of the KDD-2003 workshop on data cleaning, record linkage, and object consolidation, Washington, DC (August 2003)Google Scholar
  4. 4.
    Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences, http://www.ics.uci.edu/~mlearn/MLRepository.html
  5. 5.
    Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: Proceedings of the 21st international conference on data engineering, Tokyo (April 2005)Google Scholar
  6. 6.
    Christen, P., Churches, T., Hegland, M.: A parallel open source data linkage system. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 638–647. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  7. 7.
    Cohen, W.W., Richman, J.: Learning to match and cluster large highdimensional data sets for data integration. In: Proceedings of the 8th ACM SIGKDD conference, Edmonton (July 2002)Google Scholar
  8. 8.
    Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of IJCAI 2003 workshop on information integration on the Web (IIWeb 2003), Acapulco, August 2003, pp. 73–78 (2003)Google Scholar
  9. 9.
    Damerau, F.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3), 171–176 (1964)CrossRefGoogle Scholar
  10. 10.
    Elfeky, M.G., Verykios, V.S., Elmagarmid, A.K.: TAILOR: A record linkage toolbox. In: Proceedings of the ICDE 2002, San Jose, USA (March 2002)Google Scholar
  11. 11.
    Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Society (December 1969)Google Scholar
  12. 12.
    Gu, L., Baxter, R.: Adaptive filtering for efficient record linkage. In: SIAM international conference on data mining, Orlando, Florida (April 2004)Google Scholar
  13. 13.
    Hall, P.A.V., Dowling, G.R.: Approximate string matching. ACM computing surveys 12(4), 381–402 (1980)CrossRefMathSciNetGoogle Scholar
  14. 14.
    Hernandez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of the ACM SIGMOD conference (May 1995)Google Scholar
  15. 15.
    Kukich, K.: Techniques for automatically correcting words in text. ACM computing surveys 24(4), 377–439 (1992)CrossRefGoogle Scholar
  16. 16.
    McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the 6th ACM SIGKDD conference, Boston, August 2000, pp. 169–178 (2000)Google Scholar
  17. 17.
    Nahm, U.Y., Bilenko, M., Mooney, R.J.: Two approaches to handling noisy variation in text mining. In: Proceedings of the ICML 2002 workshop on text learning (TextML 2002), Sydney, Australia, July 2002, pp. 18–27 (2002)Google Scholar
  18. 18.
    Centre for Epidemiology and Research, NSW Department of Health. New South Wales Mothers and Babies, NSW Public Health Bull 2002; 13(S-4) (2001)Google Scholar
  19. 19.
    Pollock, J.J., Zamora, A.: Automatic spelling correction in scientific and scholarly text. Communications of the ACM 27(4), 358–368 (1984)CrossRefGoogle Scholar
  20. 20.
    Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: Proceedings of the 20th conference on uncertainty in artificial intelligence, Banff, Canada (July 2004)Google Scholar
  21. 21.
    Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD conference, Edmonton (July 2002)Google Scholar
  22. 22.
    Tejada, S., Knoblock, C.A., Minton, S.: Learning domain-independent string transformation weights for high accuracy object identification. In: Proceedings of the 8th ACM SIGKDD conference, Edmonton (July 2002)Google Scholar
  23. 23.
    Yancey, W.E.: An adaptive string comparator for record linkage RR 2004-02, US Bureau of the Census (February 2004)Google Scholar
  24. 24.
    Zhu, J.J., Ungar, L.H.: String edit analysis for merging databases. In: KDD 2000 workshop on text mining, held at the 6th ACM SIGKDD conference, Boston (August 2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Peter Christen
    • 1
  1. 1.Department of Computer ScienceAustralian National UniversityCanberraAustralia

Personalised recommendations