A Precautionary Approach to Big Data Privacy

Part of the Law, Governance and Technology Series book series (LGTS, volume 24)


Once released to the public, data cannot be taken back. As time passes, data analytic techniques improve and additional datasets become public that can reveal information about the original data. It follows that released data will get increasingly vulnerable to re-identification—unless methods with provable privacy properties are used for the data release. We review and draw lessons from the history of re-identification demonstrations; explain why the privacy risk of data that is protected by ad hoc de-identification is not just unknown, but unknowable; and contrast this situation with provable privacy techniques like differential privacy. We then offer recommendations for practitioners and policymakers. Because ad hoc de-identification methods make the probability of a privacy violation in the future essentially unknowable, we argue for a weak version of the precautionary approach, in which the idea that the burden of proof falls on data releasers guides policies that incentivize them not to default to full, public releases of datasets using ad hoc de-identification methods. We discuss the levers that policymakers can use to influence data access and the options for narrower releases of data. Finally, we present advice for six of the most common use cases for sharing data. Our thesis is that the problem of “what to do about re-identification” unravels once we stop looking for a one-size-fits-all solution, and each of the six cases we consider a solution that is tailored, yet principled.


Re-identification De-identification Data Privacy Precautionary principle 


  1. Acquisti, Alessandro, Ralph Gross, and Fred Stutzman. 2011. Faces of facebook: Privacy in the age of augmented reality. Nevada: Presentation at BlackHat Las Vegas. 4 Aug 2011.Google Scholar
  2. Adida, Ben. 2008. Don’t hash secrets. Benlog. http://benlog.com/2008/06/19/dont-hash-secrets/. 19 June 2008.
  3. Angwin, Julia. 2010. The web’s new gold mine: Your secrets. Wall Street Journal. 30 July 2010.Google Scholar
  4. Barbaro, Michael and Tom Zeller, Jr. 2006. A face is exposed for AOL Searcher No. 4417749. New York Times. http://www.nytimes.com/2006/08/09/technology/09aol.html. 9 Aug 2006.
  5. Barberá, Pablo. 2014. How social medial reduces mass political polarization: Evidence from Germany, Spain, and the U.S. Unpublished manuscript, 18 Oct 2014. https://files.nyu.edu/pba220/public/barbera-polarization-social-media.pdf.
  6. Barocas, Solon and Andrew D. Selbst. Big Data’s Disparate Impact. California Law Review 104 (forthcoming).Google Scholar
  7. Barocas, Solon, and Helen Nissenbaum. 2014. Big data’s end run around anonymity and consent. In Privacy, big data, and the public good: Frameworks for Engagement, ed. Julia Lane, Victoria Stodden, Stefan Bender, and Helen Nissenbaum, 44–75. New York: Cambridge University Press.CrossRefGoogle Scholar
  8. Blum, Avrim, Katrina Ligett, and Aaron Roth. 2008. A learning theory approach to non-interactive database privacy. In Proceedings of the 40th ACM SIGACT symposium on theory of computing. Victoria, British Columbia: ACM.Google Scholar
  9. Boudreau, Kevin J., Nicola Lacetera, and Karim R. Lakhani. 2014. Incentives and problem uncertainty in innovation contests: An empirical analysis. Management Science 57(5): 843–863. doi:10.1287/mnsc.1110.1322.CrossRefGoogle Scholar
  10. Calo, Ryan. 2014. Digital market manipulation. George Washington Law Review 82: 995–1051.Google Scholar
  11. Cavoukian, Ann and Daniel Castro. 2014. Big Data and innovation, setting the record straight: De-identification does Work. Toronto, Ontario: Information and Privacy Commissioner.Google Scholar
  12. de Montjoye, Yves-Alexandre, César A. Hidalgo, Michel Verleysen, and Vincent D. Blondel. 2013. Unique in the crowd: The privacy bounds of human mobility. Scientific Reports 3. doi:10.1038/srep01376.
  13. Dey, Ratan, Yuan Ding, and Keith W. Ross. 2013. The high-school profiling attack: How online privacy laws can actually increase minors’ risk. Paper presented at the 13th Privacy Enhancing Technologies Symposium, Bloomington, IN. https://www.petsymposium.org/2013/papers/dey-profiling.pdf. 12 July 2013.
  14. DHS Data Privacy and Integrity Advisory Committee FY 2005 Meeting Materials. 2005. Statement of Latanya Sweeney, associate professor of computer science, technology and policy and director of the data privacy laboratory. Carnegie Mellon University. http://www.dhs.gov/xlibrary/assets/privacy/privacy_advcom_06-2005_testimony_sweeney.pdf. 15 June 2015.
  15. Directive 95/46/EC, of the European Parliament and of the Council of 24 October 1995 on the Protection of Individuals with Regard to the Processing of Personal Data and on the Free Movement of Such Data, Art. 2(a), 1995 O.J. (C 93).Google Scholar
  16. Dwork, Cynthia, Frank McSherry, Kobbi Nissim, and Adam Smith. 2011. Differential privacy—a primer for the perplexed. Paper presented at the Joint UNECE/Eurostat work session on statistical data confidentiality, Tarragona, Spain.Google Scholar
  17. Dwork, Cynthia, and Deirdre K. Mulligan. 2013. It’s not privacy, and it’s not fair. Stanford Law Review Online 66: 35–40.Google Scholar
  18. Edler, Jakob, and Luke Georghiou. 2007. Public procurement and innovation—Resurrecting the demand side. Research Policy 36(7): 949–963.CrossRefGoogle Scholar
  19. Edquist, Charles and Jon Mikel Zabala-Iturriagagoitia. 2012. Public procurement for innovation as mission-oriented innovation policy. Research Policy 41(10): 1757–69.Google Scholar
  20. El Emam, Khaled, Luk Arbuckle, Gunes Koru, Benjamin Eze, Lisa Gaudette, Emilio Neri, Sean Rose, Jeremy Howard, and Jonathan Gluck. 2012. De-identification methods for open health data: The case of the Heritage Health Prize claims dataset. Journal of Medical Internet Research 14(1): e33. doi:10.2196/jmir.2001.
  21. El Emam, Khaled and Luk Arbuckle. 2014. Why de-identification is a key solution for sharing data responsibly. Future of Privacy Forum. http://www.futureofprivacy.org/2014/07/24/de-identification-a-critical-debate/. 24 July 2014.
  22. Englehardt, Steven, Dillon Reisman, Christian Eubank, Peter Zimmerman, Jonathan Mayer, Arvind Narayanan, and Edward W. Felten. 2015. Cookies that give you away: Evaluating the surveillance implications of web tracking. Paper accepted at 24th International World Wide Web Conference, Florence.Google Scholar
  23. Erlingsson, Úlfar, Vasyl Pihur, and Aleksandra Korolova. 2014. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, 1054–67. Scottsdale, Arizona: ACM.Google Scholar
  24. Executive Office of the President, President’s Council of Advisors on Science and Technology. 2014. Report to the President: Big Data and Privacy: A Technological Perspective. Washington, DC.Google Scholar
  25. Federal Trade Commission. 2012. Protecting consumer privacy in an era of rapid change: Recommendations for businesses and policymakers. Washington, DC. http://www.ftc.gov/sites/default/files/documents/reports/federal-trade-commission-report-protecting-consumer-privacy-era-rapid-change-recommendations/120326privacyreport.pdf.
  26. Felten, Ed. 2012a. Are pseudonyms ‘anonymous’? Tech@FTC. https://techatftc.wordpress.com/2012/04/30/are-pseudonyms-anonymous/. 30 Apr 2012.
  27. Felten, Ed. 2012b. Does hashing make data ‘anonymous’? Tech@FTC. https://techatftc.wordpress.com/2012/04/22/does-hashing-make-data-anonymous/. 22 Apr 2012.
  28. Forsberg, Kerstin. 2013. De-identification and informed consent in clinical trials. Linked Data for Enterprises. http://kerfors.blogspot.com/2013/11/de-identification-and-informed-consent.html. 17 Nov 2013.
  29. Gagnon, Michael N. 2011. Hashing IMEI numbers does not protect privacy. Dasient Blog. http://blog.dasient.com/2011/07/hashing-imei-numbers-does-not-protect.html. 26 July 2011.
  30. Ghosh, Anup K., Chuck Howell, and James A. Whittaker. 2002. Building software securely from the ground up. IEEE Software.Google Scholar
  31. Golle, Philippe and Kurt Partridge. 2009. On the anonymity of home/work location pairs. In Pervasive ’09 proceedings of the 7th international conference on pervasive computing, 390–97. Berlin, Heidelberg: Springer. https://crypto.stanford.edu/~pgolle/papers/commute.pdf.
  32. Golle, Philippe. 2006. Revisiting the uniqueness of simple demographics in the US population. In Proceedings of the 5th ACM workshop on privacy in electronic society, 77–80. New York, New York: ACM.Google Scholar
  33. Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. U.S. Department of Health & Human Services. http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html.
  34. Gymrek, Melissa, Amy L. McGuire, David Golan, Eran Halperin, and Yaniv Erlich. 2013. identifying personal genomes by surname inference. Science 339(6117): 321–324. doi:10.1126/science.1229566.CrossRefGoogle Scholar
  35. Hannak, Aniko, Gary Soeller, David Lazer, Alan Mislove, and Christo Wilson. 2014. Measuring price discrimination and steering on e-commerce web sites. In Proceedings of the 2014 conference on internet measurement conference, 305–318. Vancouver: ACM.Google Scholar
  36. HCUP, SID/SASD/SEDD Application Kit. 2014. http://www.hcup-us.ahrq.gov/db/state/SIDSASDSEDD_Final.pdf.
  37. Hooley, Sean and Latanya Sweeney. 2013. Survey of publicly available state health databases. White Paper 1075-1, Data Privacy Lab, Harvard University, Cambridge, Massachusetts. http://dataprivacylab.org/projects/50states/1075-1.pdf.
  38. Information Commissioner’s Office. 2014. Big data and data protection 28 July 2014.Google Scholar
  39. Is Cryptographic Theory Practically Relevant? Isaac Newton Institute for Mathematical Sciences. http://www.newton.ac.uk/event/sasw07.
  40. Jamal, Amaney, Robert O. Keohane, David Romney, and Dustin Tingley. 2014. Anti-Americanism and Anti-interventionism in Arabic Twitter Discourses. Unpublished manuscript, 20 Oct 2014. http://scholar.harvard.edu/files/dtingley/files/aatext.pdf.
  41. Jeppesen, Lars Bo, and Karim R. Lakhani. 2010. Marginality and problem solving effectiveness in broadcast search. Organization Science 21(5): 1016–1033.CrossRefGoogle Scholar
  42. Klarreich, Erica. 2012. Privacy by the numbers: A new approach to safeguarding data. Quanta Magazine. 10 Dec 2012.Google Scholar
  43. Knoppers, Bartha Maria, Jennifer R. Harris, Anne Marie Tassé, Isabelle Budin-Ljøsne, Jane Kaye, Mylène Deschênes, and Ma’n H Zawati. 2011. Towards a data sharing code of conduct for international genomic research. Genome Medicine 3: 46.Google Scholar
  44. Krishnamurthy, Balachander and Craig E. Wills. 2009. On the leakage of personally identifiable information via online social networks. In Proceedings of the 2nd ACM workshop on online social networks. New York, New York: ACM. http://www2.research.att.com/~bala/papers/wosn09.pdf.
  45. Lessig, Lawrence. 2009. Against transparency. New Republic. 9 Oct 2009.Google Scholar
  46. Lewis, Paul and Dominic Rushe. 2014. Revealed: How whisper app tracks ‘anonymous’ users. The Guardian. http://www.theguardian.com/world/2014/oct/16/-sp-revealed-whisper-app-tracking-users. 16 Oct 2014.
  47. Li, Ninghui, Tiancheng Li, and Suresh Venkatasubramanian. 2007. t-closeness: Privacy beyond k-anonymity and l-diversity. In IEEE 23rd international conference on data engineering, 2007, 106–15. Piscataway, NJ: IEEE.Google Scholar
  48. Machanavajjhala, Ashwin, Johannes Gehrke, Daniel Kifer, and Muthuramakrishnan Venkitasubramanian. 2007. l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD) 1(1): 3.Google Scholar
  49. McGraw, Gary and John Viega. 2001. Introduction to software security. InformIT. http://www.informit.com/articles/article.aspx?p=23950&seqNum=7. 2 Nov 2001.
  50. Medicare Provider Utilization and Payment Data: Physician and Other Supplier. 2014. Centers for medicare & medicaid services. http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other-Supplier.html. Last modified 23 Apr 2014.
  51. Narayanan, Arvind. 2011. An adversarial analysis of the reidentifiability of the heritage health prize dataset. Unpublished manuscript.Google Scholar
  52. Narayanan, Arvind and Vitaly Shmatikov. 2009. De-anonymizing Social Networks. In Proceedings of the 2009 30th IEEE symposium on security and privacy, 173–87. Washington, D.C.: IEEE Computer Society.Google Scholar
  53. Narayanan, Arvind and Vitaly Shmatikov. 2008. Robust de-anonymization of large sparse datasets. In Proceedings 2008 IEEE symposium on security and privacy, 111–25, Oakland, California, USA Los Alamitos, California: IEEE Computer Society. 18–21 May 2008.Google Scholar
  54. Narayanan, Arvind. 2013. What happened to the crypto dream? Part 2. IEEE Security and Privacy 11(3): 68–71.CrossRefGoogle Scholar
  55. Ochoa, Salvador, Jamie Rasmussen, Christine Robson, and Michael Salib. 2001. Reidentification of individuals in Chicago’s homicide database: A technical and legal study. Final project, 6.805 Ethics and Law on the Electronic Frontier, Massachusetts Institute of Technology, Cambridge, Massachusetts. http://mike.salib.com/writings/classes/6.805/reid.pdf. 5 May 2001.
  56. Ohm, Paul. 2010. Broken promises of privacy: Responding to the surprising failure of anonymization. UCLA Law Review 57: 1742–43. http://uclalawreview.org/pdf/57-6-3.pdf.
  57. OnTheMap. U.S. Census Bureau. http://onthemap.ces.census.gov/.
  58. Pandurangan, Vijay. 2014. On taxis and rainbows: Lessons from NYC’s improperly anonymized taxi logs. Medium. https://medium.com/@vijayp/of-taxis-and-rainbows-f6bc289679a1. 21 June 2014.
  59. Proposal for a Regulation of the European Parliament and of the Council on the Protection of Individuals with Regard to the Processing of Personal Data and on the Free Movement of Such Data. Art. 4(1)-(2) 25 Jan 2012.Google Scholar
  60. Roberts, Margaret E. 2014. Fear or friction? How censorship slows the spread of information in the digital age. Unpublished manuscript, 26 Sept 2014. http://scholar.harvard.edu/files/mroberts/files/fearfriction_1.pdf.
  61. Rubyrescue. 2014. Comment on blackRust, “How Whisper app tracks ‘anonymous’ users.” Hacker News. https://news.ycombinator.com/item?id=8465482. 17 Oct 2014.
  62. Sachs, Noah M. 2011. Rescuing the strong precautionary principle from its critics. Illinois Law Review 2011(4): 1313.Google Scholar
  63. Salganik, Matthew J. and Karen E.C. Levy. 2014. Wiki surveys: Open and quantifiable social data collection. Unpublished manuscript, 2 Oct 2014. http://arxiv.org/abs/1202.0500.
  64. Siddle, James. 2014. I know where you were last summer: London’s public bike data is telling everyone where you’ve been. The Variable Tree. http://vartree.blogspot.com/2014/04/i-know-where-you-were-last-summer.html. 10 Apr 2014.
  65. Solove, Daniel J. 2007.‘I’ve got nothing to hide’ and other misunderstandings of privacy. San Diego Law Review 44: 760–64.Google Scholar
  66. Stodden, Victoria. 2012. Data access going the way of journal article access? Insist on open data. Victoria’s Blog. http://blog.stodden.net/2012/12/24/data-access-going-the-way-of-journal-article-access/. 24 Dec 2012.
  67. Sunstein, Cass R. 2002. The paralyzing principle. Regulation 25(4): 33–35.Google Scholar
  68. Sweeney, Latanya, Akua Abu, and Julia Winn. 2013. Identifying participants in the personal genome project by name. White Paper 1021-1, Data Privacy Lab. Harvard University, Cambridge, Massachusetts. http://dataprivacylab.org/projects/pgp/1021-1.pdf. 24 Apr 2013.
  69. Sweeney, Latanya. 2001. k-anonymity: A model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10(5): 557–570.CrossRefGoogle Scholar
  70. Sweeney, Latanya. 2013. Matching known patients to health records in Washington State Data. White Paper 1089-1, Data Privacy Lab. Harvard University, Cambridge, Massachusetts. http://dataprivacylab.org/projects/wa/1089-1.pdf.
  71. Sweeney, Latanya. 2000. Simple demographics often identify people uniquely. Data Privacy Working Paper 3. Carnegie Mellon University, Pittsburgh, Pennsylvania. http://dataprivacylab.org/projects/identifiability/paper1.pdf.
  72. Task, Christine. 2013. An illustrated primer in differential privacy. XRDS 20(1): 53–57.CrossRefGoogle Scholar
  73. The White House. 2012. Consumer data privacy in a networked world: A framework for protecting privacy and promoting innovation in the global digital economy. Washington, D.C.Google Scholar
  74. Tockar, Anthony. 2014. Riding with the stars: Passenger privacy in the NYC taxicab dataset. Neustar: Research. http://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/. 15 Sept 2014.
  75. Ugander, Johan, Brian Karrer, Lars Backstrom, and Cameron Marlow. 2011. The anatomy of the Facebook social graph. arXiv Preprint. http://arxiv.org/pdf/1111.4503v1.pdf.
  76. Understanding Online Advertising: Frequently Asked Questions. Network advertising initiative. http://www.networkadvertising.org/faq.
  77. Uyarra, Elvira, and Kieron Flanagan. 2010. Understanding the innovation impacts of public procurement. European Planning Studies 18(1): 123–143.CrossRefGoogle Scholar
  78. Vertesi, Janet. 2014. My experiment opting out of big data made me look like a criminal. Time, 1 May 2014.Google Scholar
  79. What are single nucleotide polymorphisms (SNPs)? 2014. Genetics home reference: Your guide to understanding genetic conditions. Published 20 Oct 2014. http://ghr.nlm.nih.gov/handbook/genomicresearch/snp.
  80. Whong, Chris. 2014. FOILing NYC’s Taxi Trip Data. http://chriswhong.com/open-data/foil_nyc_taxi/. 18 Mar 2014.
  81. Zang, Hui and Jean Bolot. 2011. Anonymization of location data does not work: A large-scale measurement study. In Proceedings of the 17th international conference on mobile computing and networking, 145–56. New York: ACM.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2016

Authors and Affiliations

  • Arvind Narayanan
    • 1
  • Joanna Huey
    • 1
  • Edward W. Felten
    • 1
  1. 1.Center for Information Technology PolicyPrinceton UniversityPrincetonUSA

Personalised recommendations