The Effects of Location Access Behavior on Re-identification Risk in a Distributed Environment
In this paper, we investigate how location access patterns influence the re-identification of seemingly anonymous data. In the real world, individuals visit different locations that gather similar information. For instance, multiple hospitals collect health information on the same patient. To protect anonymity for research purposes, hospitals share sensitive data, such as DNA sequences, stripped of explicit identifiers. Separately, for administrative functions, identified data, stripped of DNA, is made available. On a hospital by hospital basis, each pair of DNA and identified databases appears unlinkable, however, links can be established when multiple locations’ database are studied. This problem, known as trail re-identification, is a generalized phenomenon and occurs because an individual’s location access pattern can be matched across the shared databases.
Data holders can not exchange data to find and suppress trails that would be re-identified. Thus, it is important to assess the re-identification risk in a system in order to develop techniques to mitigate it. In this research, we evaluate several real world datasets and observe trail re-identification is related to the number of people to places. To study this phenomenon in more detail, we develop a generative model for location access patterns that simulates observed behavior. We evaluate trail re-identification risk in a range of simulated patterns and our findings suggest that the skew of the distribution of people to places is one of the main factors that drives trail re-identification.
KeywordsSensitive Data Record Linkage Reserved System Location Access Biomedical Informatics
Unable to display preview. Download preview PDF.
- 1.Altman, R.: Bioinformatics in support of molecular medicine. In: Proceedings of the American Medical Informatics Association Annual Symposium, Miami Beach, FL, pp. 53–61 (1998)Google Scholar
- 2.Sax, U., Schmidt, S.: Integration of genomic data in electronic health records: opportunities and dilemmas. Methods of Information in Medicine 44, 546–550 (2005)Google Scholar
- 4.Department of Health and Human Services: 45 cfr (code of federal regulations), parts 160 - 164. standards for privacy of individually identifiable health information, final rule. Federal Register 67, 53182–53273 (2002)Google Scholar
- 7.Malin, B.: Betrayed by my shadow: learning data identity via trail matching. Journal of Privacy Technology, 20050609001 (2005)Google Scholar
- 8.de Moor, G., Claerhout, B., de Meyer, F.: Privacy enhancing technologies: the key to secure communication and management of clinical and genomic data. Methods of Information in Medicine 42, 148–153 (2003)Google Scholar
- 10.Lin, Z., Owen, A., Altman, R.: Genomic research and human subject privacy. Science 305 (2004)Google Scholar
- 11.Malin, B., Sweeney, L.: Composition and disclosure of unlinkable distributed databases. In: Proceedings of the 22nd IEEE International Conference on Data Engineering, Atlanta, GA (2006)Google Scholar
- 12.Airoldi, E.M.: A statistical theory of record linkage with applications to privacy. Technical Report CMU-ISRI-05-112, School of Computer Science, Carnegie Mellon University (2004) Revision (December 2005)Google Scholar
- 13.Bender, S., Brand, R., Bacher, J.: Re-identifying register data by survey data: an empirical study. Statistical Journal of the United Nations ECE 18, 373–381 (2001)Google Scholar
- 14.Griffith, V., Jakobsson, M.: Messin with texas: deriving mother’s maiden name using public records. In: Proceedings of the Applied Cryptography and Network Security Conference, New York, NY (2005)Google Scholar
- 15.Malin, B., Sweeney, L.: Determining the identifiability of dna database entries. In: Proceedings of the American Medical Informatics Association Annual Symposium, Los Angeles, CA, pp. 537–541 (2000)Google Scholar
- 16.Sweeney, L.: Uniqueness of simple demographics in the us population. Technical Report LIDAP-WP04, Data Privacy Laboratory, Carnegie Mellon University, Pittsburgh, PA (2000)Google Scholar
- 18.Danezis, G., Serjantov, A.: Statistical disclosure or intersection attacks on anonymity systems. In: Varadharajan, V., Mu, Y. (eds.) ACISP 2001. LNCS, vol. 2119, Springer, Heidelberg (2001)Google Scholar
- 19.Kesdogan, D., Agrawal, D., Penz, S.: Limits of anonymity in open environments. In: Varadharajan, V., Mu, Y. (eds.) ACISP 2001. LNCS, vol. 2119, Springer, Heidelberg (2001)Google Scholar
- 20.Winkler, W.E.: Matching and record linkage. In: Cox, et al. (eds.) Business Survey Methods, pp. 355–384. J. Wiley, New York (1995)Google Scholar
- 21.Winkler, W.: Data cleaning methods. In: Proceedings of the ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC (2003)Google Scholar
- 22.State of Illinois Health Care Cost Containment Council: Data release overview. State of Illinois Health Care Cost Containment Council, Springfield, IL (March 1998)Google Scholar