Parallelizing Record Linkage for Disclosure Risk Assessment

  • Joan Guisado-Gámez
  • Arnau Prat-Pérez
  • Jordi Nin
  • Victor Muntés-Mulero
  • Josep Ll. Larriba-Pey
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5262)

Abstract

Handling very large volumes of confidential data is becoming a common practice in many organizations such as statistical agencies. This calls for the use of protection methods that have to be validated in terms of the quality they provide. With the use of Record Linkage (RL) it is possible to compute the disclosure risk, which gives a measure of the quality of a data protection method. However, the RL methods proposed in the literature are computationally costly, which poses difficulties when frequent RL processes have to be executed on large data.

Here, we propose a distributed computing technique to improve the performance of a RL process. We show that our technique not only improves the computing time of a RL process significantly, but it is also scalable in a distributed environment. Also, we show that distributed computation can be complemented with SMP based parallelization in each node increasing the final speedup.

Keywords

Record linkage parallel computing distributed computing disclosure risk evaluation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. The Very Large Database Journal, 334–350 (2001)Google Scholar
  2. 2.
    Newcombe, H.B.: Record linking: The design of efficient systems for linking records into individuals and family histories. American Journal of Human Genetics (1967)Google Scholar
  3. 3.
    Do, H.H., Rahm, E.: COMA - A system for exible combination of schema matching approaches. In: Proceedings of the 28th Very Large Databases Conference, pp. 610–621 (2002)Google Scholar
  4. 4.
    Kim, H., Lee, D.: Parallel Linkage. In: CIKM, Lisboa, Portugal (2007)Google Scholar
  5. 5.
  6. 6.
    Gómez, J., Larriba, J.L., Ribes, J.: Improving Record Linkage Performance. Technical report UPC-DAC-RR-2006-15Google Scholar
  7. 7.
    Jaro, M.A.: Advances in Record Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Society, 414–420 (1989)Google Scholar
  8. 8.
    Atencia, M., Schorlemmer, M.: A formal model for situated semantic alignment. In: Proceedings of the 6th International Conference in Agent and Multiagent Systems (2007)Google Scholar
  9. 9.
    Bilenko, M., Basu, S., Sahami, M.: Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Sopping. In: Proceedings of the 5th Int’l. Conference on Data Mining 2005, pp. 58–65 (2005)Google Scholar
  10. 10.
    Hernandez, M., Stolfo, S.: The merge/purge problem for large database. In: ACT SGMOD Conf. Proc., pp. 127–138 (1995)Google Scholar
  11. 11.
    Christen, P., Churches, T.: Febrl: Freely extensible biomedical record linkage. Joint Computer Science Technical Report TR-CS-02-05 (2002)Google Scholar
  12. 12.
    Brown, R.G.: Engineering a Beowulf-style Compute Cluster. Duke University Physics Department (2004)Google Scholar
  13. 13.
    Deen, S.M., Amin, R.R., Taylor, M.C.: Data integration in distributed databases. IEEE Transactions on Software Engineering (1987)Google Scholar
  14. 14.
    Sung, S.Y., Li, Z., Peng, S.: A Fast Filtering Scheme for Large Database Cleansing. In: International Conference on Information and Knowledge Management (CIKM), McLean, Virginia,USA (2002)Google Scholar
  15. 15.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 707–710 (1966)Google Scholar
  16. 16.
    Torra, V., Domingo-Ferrer, J.: Record linkage methods for multidatabase data mining. In: Information Fusion in Data Mining, pp. 101–132. Springer, Heidelberg (2003)Google Scholar
  17. 17.
    Winkler, W.E.: Data cleaning methods. In: Proc. SIGKDD 2003, Washington (2003)Google Scholar
  18. 18.
    Winkler, W.E.: Re-identification methods for masked microdata. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 216–230. Springer, Heidelberg (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Joan Guisado-Gámez
    • 1
  • Arnau Prat-Pérez
    • 1
  • Jordi Nin
    • 2
  • Victor Muntés-Mulero
    • 1
  • Josep Ll. Larriba-Pey
    • 1
  1. 1.DAMA-UPC, Dept. d’Arquitectura de ComputadorsUniversitat Politècnica de CatalunyaBarcelonaSpain
  2. 2.IIIA, Artificial Intelligence Research Institute CSICSpanish National Research CouncilBellaterraSpain

Personalised recommendations