Advertisement

Annals of Operations Research

, Volume 195, Issue 1, pp 97–110 | Cite as

Choquet integral for record linkage

  • Daniel Abril
  • Guillermo Navarro-ArribasEmail author
  • Vicenç Torra
Article

Abstract

Record linkage is used in data privacy to evaluate the disclosure risk of protected data. It models potential attacks, where an intruder attempts to link records from the protected data to the original data. In this paper we introduce a novel distance based record linkage, which uses the Choquet integral to compute the distance between records. We use a fuzzy measure to weight each subset of variables from each record. This allows us to improve standard record linkage and provide insightful information about the re-identification risk of each variable and their interaction. To do that, we use a supervised learning approach which determines the optimal fuzzy measure for the linkage.

Keywords

Data privacy Record linkage Choquet integral Optimization 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Batini, C., & Scannapieco, M. (2006). Data quality: concepts, methodologies and techniques series (data-centric systems and applications). New York: Springer Google Scholar
  2. Brand, R., Domingo-Ferrer, J., & Mateo-Sanz, J. M. (2002). Reference datasets to test and compare SDC methods for protection of numerical microdata. Technical report, European Project IST-2000-25069 CASC. Google Scholar
  3. Choquet, G. (1953). Theory of capacities. Annales de L’Institut Fourier, 5, 131–295. CrossRefGoogle Scholar
  4. Colledge, M. (1995). Frames and business registers: an overview. business survey methods. Wiley series in probability and statistics. New York: Wiley. Google Scholar
  5. Data.gov.uk (2010). UK Government. Google Scholar
  6. Data.gov (2010). USA Government. Google Scholar
  7. Defays, D., & Nanopoulos, P. (1993). Panels of enterprises and confidentiality: the small aggregates method. In Proc. of the 1992 symposium on design and analysis of longitudinal surveys, statistics, Canada (pp. 195–204). Google Scholar
  8. Domingo-Ferrer, J., & Torra, V. (2001). A quantitative comparison of disclosure control methods for microdata. In P. Doyle, J. Lane, J. Theeuwes, & L. Zayatz (Eds.), Confidentiality, disclosure, and data access: theory and practical applications for statistical agencies (pp. 111–133). Amsterdam: Elsevier. Google Scholar
  9. Domingo-Ferrer, J., & Torra, V. (2005). Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Mining and Knowledge Discovery, 11(2), 195–212. CrossRefGoogle Scholar
  10. Domingo-Ferrer, J., Mateo-Sanz, J. M., & Torra, V. (2001). Comparing sdc methods for microdata on the basis of information loss and disclosure risk. In Preproceedings of ETK-NTTS 2001 (Vol. 2, pp. 807–826). Luxembourg: Eurostat. Google Scholar
  11. Domingo-Ferrer, J., Torra, V., Mateo-Sanz, J. M., & Sebe, F. (2006). Empirical disclosure risk assessment of the ipso synthetic data generators. In Monographs in official statistics-work session on statistical data confidentiality (pp. 227–238). Luxembourg: Eurostat. Google Scholar
  12. Dunn, H. L. (1946). Record Linkage. American Journal of Public Health, 36(12), 1412–1416. CrossRefGoogle Scholar
  13. Elmagarmid, A., Panagiotis, G., & Verykios, V. (2007). Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16. CrossRefGoogle Scholar
  14. Fellegi, I., & Sunter, A. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210. CrossRefGoogle Scholar
  15. Hartley, H. (1958). Maximum likelihood estimation from incomplete data. Biometrics, 14, 174–194. CrossRefGoogle Scholar
  16. IBM (2010). IBM ILOG CPLEX, High-performance mathematical programming engine. International Business Machines Corp. http://www-01.ibm.com/software/integration/optimization/cplex/.
  17. Jaro, M. A. (1989). Advances in record linkage methodology as applied to matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Society, 84(406), 414–420. Google Scholar
  18. Lane, J., Heus, P., & Mulcahy, T. (2008). Data access in a cyber world: making use of cyberinfrastructure. Transactions on Data Privacy, 1(1), 2–16. Google Scholar
  19. Laszlo, M., & Mukherjee, S. (2005). Minimum spanning tree partitioning algorithm for microaggregation. IEEE Transactions on Knowledge and Data Engineering, 17(7), 902–911. CrossRefGoogle Scholar
  20. McLachlan, G., & Krishnan, T. (1997). The EM algorithm and extensions. Wiley series in probability and statistics. New York: Wiley. Google Scholar
  21. Newcombe, H. B., Kennedy, J. M., Axford, S. J., & James, A. P. (1959). Automatic linkage of vital records. Science, 130, 954–959. CrossRefGoogle Scholar
  22. Pagliuca, D., & Seri, G. (1999). Some results of individual ranking method on the system of enterprise accounts annual survey. Esprit SDC Project, Delivrable MI-3/D2. Google Scholar
  23. Statistics Canada (2010). Record linkage at Statistics Canada. http://www.statcan.gc.ca/record-enregistrement/index-eng.htm.
  24. Templ, M. (2008). Statistical disclosure control for microdata using the R-Package sdcMicro. Transactions on Data Privacy, 1(2), 67–85. Google Scholar
  25. Templ, M., & Petelin, T. (2009). A graphical user interface for microdata protection which provides reproducibility and interactions: the sdcMicro GUI. Transactions on Data Privacy, 2(3), 207–224. Google Scholar
  26. Torra, V. (2004). Microaggregation for categorical variables: a median based approach. In Lecture notes in computer science: Vol. 3050. Proc. privacy in statistical databases (PSD 2004) (pp. 162–174). Berlin: Springer. CrossRefGoogle Scholar
  27. Torra, V. (2008). Constrained microaggregation: adding constraints for data editing. Transactions on Data Privacy, 1(2), 86–104. Google Scholar
  28. Torra, V., & Narukawa, Y. (2007). Modeling decisions: information fusion and aggregation operators. Berlin: Springer. Google Scholar
  29. Torra, V., Abowd, J. M., & Domingo-Ferrer, J. (2006). Using Mahalanobis distance-based record linkage for disclosure risk assessment. In Lecture notes in computer science: Vol. 4302. Privacy in statistical databases 2006 (pp. 233–242). Berlin: Springer. CrossRefGoogle Scholar
  30. Torra, V., Navarro-Arribas, G., & Abril, D. (2010). On the applications of aggregation operators in data privacy. In Advances in soft computing (integrated uncertainty management and applications): Vol. 68. International symposium on integrated uncertainty management and applications (pp. 479–488). CrossRefGoogle Scholar
  31. U.S. Census Bureau (2010). Data Extraction System. http://www.census.gov/.
  32. Winkler, W. E. (2003). Data cleaning methods. In Ninth ACM SIGKDD international conference on knowledge discovery and data mining. Google Scholar
  33. Winkler, W. E. (2004). Re-identification methods for masked microdata. In Lecture notes in computer science: Vol. 3050. Privacy in statistical databases, PSD 2004 (pp. 216–230). Berlin: Springer. Google Scholar
  34. Yancey, W., Winkler, W., & Creecy, R. (2002). Disclosure risk assessment in perturbative microdata protection. In Lecture notes in computer science: Vol. 2316. Inference control in statistical databases (pp. 135–152). Berlin: Springer. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Daniel Abril
    • 1
  • Guillermo Navarro-Arribas
    • 1
    Email author
  • Vicenç Torra
    • 1
  1. 1.IIIA, Institut d’Investigació en Intel⋅ligència Artificial—CSICConsejo Superior de Investigaciones CientíficasBellaterraSpain

Personalised recommendations