Accurate Synthetic Generation of Realistic Personal Information

  • Peter Christen
  • Agus Pudjijono
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5476)

Abstract

A large portion of data collected by many organisations today is about people, and often contains personal identifying information, such as names and addresses. Privacy and confidentiality are of great concern when such data is being shared between organisations or made publicly available. Research in (privacy-preserving) data mining and data linkage is suffering from a lack of publicly available real-world data sets that contain personal information, and therefore experimental evaluations can be difficult to conduct. In order to overcome this problem, we have developed a data generator that allows flexible creation of synthetic data containing personal information with realistic characteristics, such as frequency distributions, attribute dependencies, and error probabilities. Our generator significantly improves earlier approaches, and allows the generation of data for individuals, families and households.

Keywords

Artificial data data matching data linkage privacy data mining pre-processing 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Christen, P.: Privacy-preserving data linkage and geocoding: Current approaches and research directions. In: ICDM PADM workshop, Hong Kong (2006)Google Scholar
  2. 2.
    Hernandez, M., Stolfo, S.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)CrossRefGoogle Scholar
  3. 3.
    Christen, P., Goiser, K.: Quality and complexity measures for data linkage and deduplication. In: Quality Measures in Data Mining. Studies in Computational Intelligence, vol. 43, pp. 127–151. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  4. 4.
    Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1–16 (2007)CrossRefGoogle Scholar
  5. 5.
    Christen, P.: Probabilistic data generation for deduplication and data linkage. In: Gallagher, M., Hogan, J.P., Maire, F. (eds.) IDEAL 2005. LNCS, vol. 3578, pp. 109–116. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  6. 6.
    Bertolazzi, P., De Santis, L., Scannapieco, M.: Automated record matching in cooperative information systems. In: DQCIS, Siena, Italy (2003)Google Scholar
  7. 7.
    Pudjijono, A.: Probabilistic data generation. Master of Computing (Honours) thesis, Department of Computer Science, The Australian National University (2008)Google Scholar
  8. 8.
    Pollock, J., Zamora, A.: Automatic spelling correction in scientific and scholarly text. Communications of the ACM 27(4), 358–368 (1984)CrossRefGoogle Scholar
  9. 9.
    Christen, P.: A comparison of personal name matching: Techniques and practical issues. In: ICDM MCD workshop, Hong Kong (2006)Google Scholar
  10. 10.
    Christen, P.: Febrl – An open source data cleaning, deduplication and record linkage system with a graphical user interface. In: ACM KDD, Las Vegas (2008)Google Scholar
  11. 11.
    Phua, C., Lee, V., Smith-Miles, K.: The personal name problem and a recommended data mining solution. In: Encyclopedia of Data Warehousing and Mining, 2nd edn., Information Science Reference (2008)Google Scholar
  12. 12.
    Damerau, F.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3), 171–176 (1964)CrossRefGoogle Scholar
  13. 13.
    Hall, P., Dowling, G.: Approximate string matching. ACM Computing Surveys 12(4), 381–402 (1980)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Kukich, K.: Techniques for automatically correcting words in text. ACM Computing Surveys 24(4), 377–439 (1992)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Peter Christen
    • 1
  • Agus Pudjijono
    • 2
  1. 1.School of Computer ScienceThe Australian National UniversityCanberraAustralia
  2. 2.Data CenterMinistry of Public Works of Republic of IndonesiaJakartaIndonesia

Personalised recommendations