Advertisement

Heuristic Supervised Approach for Record Linkage

  • Javier Murillo
  • Daniel Abril
  • Vicenç Torra
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7647)

Abstract

Record linkage is a well known technique used to link records from one database to records from another database which make reference to the same individuals. Although it is usually used in database integration, it is also used in the data privacy field for the disclosure risk evaluation of protected datasets. In this paper we compare two different supervised algorithms which rely on distance-based record linkage techniques, specifically using the Choquet integral’s fuzzy integral to compute the distance between records. The first approach uses a linear optimization problem which determines the optimal fuzzy measure for the linkage. While, the second approach is a kind of gradient algorithm with constraints for the fuzzy measures’ identification. We show the advantages and drawbacks of both algorithms and also in which situations they will work better.

Keywords

Fuzzy measure Choquet integral Record linkage Heuristic Optimization 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Statistics Canada. Record linkage at statistics canada (2010), http://www.statcan.gc.ca/record-enregistrement/index-eng.htm
  2. 2.
    Abril, D., Navarro-Arribas, G., Torra, V.: Choquet integral for record linkage. Annals of Operations Research, 1–14, 10.1007/s10479-011-0989-xGoogle Scholar
  3. 3.
    Abril, D., Navarro-Arribas, G., Torra, V.: Supervised learning using mahalanobis distance for record linkage. In: Bernard De Baets, R.M., Troiano, L. (eds.) Proc. of 6th International Summer School on Aggregation Operators - AGOP 2011, pp. 223–228 (2011), Lulu.com
  4. 4.
    Abril, D., Navarro-Arribas, G., Torra, V.: Improving record linkage with supervised learning for disclosure risk assessment. Information Fusion 13(4), 274–284 (2012)CrossRefGoogle Scholar
  5. 5.
    Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications). Springer-Verlag New York, Inc. (2006)Google Scholar
  6. 6.
    Brand, R.: Microdata Protection through Noise Addition. In: Domingo-Ferrer, J. (ed.) Inference Control in Statistical Databases. LNCS, vol. 2316, pp. 97–116. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  7. 7.
    Brand, R., Domingo-Ferrer, J., Mateo-Sanz, J.: Reference datasets to test and compare sdc methods for protection of numerical microdata. Technical report, European Project IST-2000-25069 CASC (2002)Google Scholar
  8. 8.
    U.S. Census Bureau. Data extraction systemGoogle Scholar
  9. 9.
    Choquet, G.: Theory of capacities. Annales de l’institut Fourier 5, 131–295 (1953)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Colledge, M.: Frames and business registers: An overview. Business Survey Methods. Wiley Series in Probability and Statistics (1995)Google Scholar
  11. 11.
    Defays, D., Nanopoulos, P.: Panels of enterprises and confidentiality: The small aggregates method. In: Proc. of the 1992 Symposium on Design and Analysis of Longitudinal Surveys, Statistics, Canada, pp. 195–204 (1993)Google Scholar
  12. 12.
    Domingo-Ferrer, J., Torra, V.: A quantitative comparison of disclosure control methods for microdata. In: Doyle, P., Lane, J., Theeuwes, J., Zayatz, L. (eds.) Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, pp. 111–133. Elsevier (2001)Google Scholar
  13. 13.
    Domingo-Ferrer, J., Torra, V.: Ordinal, continous and heterogeneous anonymity through microaggregation. Data Mining and Knowledge Discovery 11(2), 195–212 (2005)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Dunn, H.: Record linkage. American Journal of Public Health 36(12), 1412–1416 (1946)CrossRefGoogle Scholar
  15. 15.
    Elmagarmid, A., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1–16 (2007)CrossRefGoogle Scholar
  16. 16.
    Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association 64(328), 1183–1210 (1969)CrossRefGoogle Scholar
  17. 17.
    Grabisch, M.: A new algorithm for identifying fuzzy measures and its application to pattern recognition. In: Fourth IEEE International Conference on Fuzzy Systems, Yokohama, Japan, pp. 145–150 (1995)Google Scholar
  18. 18.
    J. P. E. Group. Standard IS 10918-1 (ITU-T T.81) (2001), http://www.jpeg.org
  19. 19.
    I. IBM ILOG CPLEX. High-performance mathematical programming engine. International Business Machines Corp. (2010)Google Scholar
  20. 20.
    Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. Journal of the American Statistical Association 84(406), 414–420 (1989)CrossRefGoogle Scholar
  21. 21.
    Lane, J., Heus, P., Mulcahy, T.: Data access in a cyber world: Making use of cyberinfrastructure. Transactions on Data Privacy 1(1), 2–16 (2008)MathSciNetGoogle Scholar
  22. 22.
    Laszlo, M., Mukherjee, S.: Minimum spanning tree partitioning algorithm for microaggregation. IEEE Trans. on Knowl. and Data Eng. 17(7), 902–911 (2005)CrossRefGoogle Scholar
  23. 23.
    Moore, R.: Controlled data swapping techniques for masking public use microdata sets. U.S. Bureau of the Census (1996) (unpublished manuscript)Google Scholar
  24. 24.
    Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130, 954–959 (1959)CrossRefGoogle Scholar
  25. 25.
    Pagliuca, D., Seri, G.: Some results of individual ranking method on the system of enterprise acounts annual survey. Esprit SDC Project, Delivrable MI-3/D2 (1999)Google Scholar
  26. 26.
    Torra, V., Abowd, J.M., Domingo-Ferrer, J.: Using Mahalanobis Distance-Based Record Linkage for Disclosure Risk Assessment. In: Domingo-Ferrer, J., Franconi, L. (eds.) PSD 2006. LNCS, vol. 4302, pp. 233–242. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  27. 27.
    Torra, V., Narukawa, Y.: Modeling Decisions: Information Fusion and Aggregation Operators. Springer (2007)Google Scholar
  28. 28.
    Torra, V., Navarro-Arribas, G., Abril, D.: Supervised learning for record linkage through weighted means and owa operators. Control and Cybernetics 39(4), 1011–1026 (2010)Google Scholar
  29. 29.
    USA Government, http://data.gov (2010)
  30. 30.
    UK Government, http://data.gov.uk (2010)
  31. 31.
    Winkler, W.E.: Data cleaning methods. In: Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2003)Google Scholar
  32. 32.
    Winkler, W.E.: Re-identification Methods for Masked Microdata. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 216–230. Springer, Heidelberg (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Javier Murillo
    • 1
  • Daniel Abril
    • 2
    • 3
  • Vicenç Torra
    • 3
  1. 1.CIFASIS-CONICETUniversidad Nacional de RosarioArgentina
  2. 2.Universitat Autònoma de Barcelona (UAB)BarcelonaSpain
  3. 3.Institut d’Investigació en Intel·ligència Artificial(IIIA), Consejo Superior de Investigaciones Científicas (CSIC)BarcelonaSpain

Personalised recommendations