Skip to main content

A Precautionary Approach to Big Data Privacy

  • Chapter

Part of the Law, Governance and Technology Series book series (ISDP,volume 24)


Once released to the public, data cannot be taken back. As time passes, data analytic techniques improve and additional datasets become public that can reveal information about the original data. It follows that released data will get increasingly vulnerable to re-identification—unless methods with provable privacy properties are used for the data release. We review and draw lessons from the history of re-identification demonstrations; explain why the privacy risk of data that is protected by ad hoc de-identification is not just unknown, but unknowable; and contrast this situation with provable privacy techniques like differential privacy. We then offer recommendations for practitioners and policymakers. Because ad hoc de-identification methods make the probability of a privacy violation in the future essentially unknowable, we argue for a weak version of the precautionary approach, in which the idea that the burden of proof falls on data releasers guides policies that incentivize them not to default to full, public releases of datasets using ad hoc de-identification methods. We discuss the levers that policymakers can use to influence data access and the options for narrower releases of data. Finally, we present advice for six of the most common use cases for sharing data. Our thesis is that the problem of “what to do about re-identification” unravels once we stop looking for a one-size-fits-all solution, and each of the six cases we consider a solution that is tailored, yet principled.


  • Re-identification
  • De-identification
  • Data
  • Privacy
  • Precautionary principle

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions


  1. 1.

    Though many of the examples are U.S.-centric, the policy recommendations have widespread applicability.

  2. 2.

    Executive Office of the President, President’s Council of Advisors on Science and Technology, Report to the President: Big Data and Privacy: A Technological Perspective (Washington, DC: 2014): 38–39.

  3. 3.

    Ed Felten, “Are pseudonyms ‘anonymous’?,” Tech@FTC, April 30, 2012,

  4. 4.

    Paul Ohm, “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization,” UCLA Law Review 57 (2010): 1742–43,

  5. 5.

    Solon Barocas and Helen Nissenbaum, “Big Data’s End Run Around Anonymity and Consent,” in Privacy, Big Data, and the Public Good: Frameworks for Engagement, ed. Julia Lane, Victoria Stodden, Stefan Bender, and Helen Nissenbaum (New York: Cambridge University Press, 2014), 52–54.

  6. 6.

    Paul Lewis and Dominic Rushe, “Revealed: how Whisper app tracks ‘anonymous’ users,” The Guardian, October 16, 2014,

  7. 7.

    Ibid. A poster self-identified as the CTO of Whisper reiterated this point: “We just don’t have any personally identifiable information. Not name, email, phone number, etc. I can’t tell you who a user is without them posting their actual personal information, and in that case, it would be a violation of our terms of service.” rubyrescue, October 17, 2014, comment on blackRust, “How Whisper app tracks ‘anonymous’ users,” Hacker News, October 17, 2014,

  8. 8.

    This is consistent with the database having a technical property called k-anonymity, with k = 10. Latanya Sweeney, “k-anonymity: A Model for Protecting Privacy,” International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10, no. 5 (2001): 557–70. Examples like this show why k-anonymity does not guarantee privacy.

  9. 9.

    Heuristics such as l-diversity and t-closeness account for such privacy-violating inferences, but they nevertheless fall short of the provable privacy concept we discuss in the next section. Ashwin Machanavajjhala et al., “l-diversity: Privacy beyond k-anonymity,” ACM Transactions on Knowledge Discovery from Data (TKDD) 1, no. 1 (2007): 3; Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian, “t-closeness: Privacy beyond k-anonymity and l-diversity,” in IEEE 23rd International Conference on Data Engineering, 2007 (Piscataway, NJ: IEEE, 2007): 106–15.

  10. 10.

    Latanya Sweeney, “Simple Demographics Often Identify People Uniquely” (Data Privacy Working Paper 3, Carnegie Mellon University, Pittsburgh, Pennsylvania, 2000),

  11. 11.

    Salvador Ochoa et al., “Reidentification of Individuals in Chicago’s Homicide Database: A Technical and Legal Study” (final project, 6.805 Ethics and Law on the Electronic Frontier, Massachusetts Institute of Technology, Cambridge, Massachusetts, May 5, 2001),

  12. 12.

    Latanya Sweeney, “Matching Known Patients to Health Records in Washington State Data” (White Paper 1089-1, Data Privacy Lab, Harvard University, Cambridge, Massachusetts, June 2013),

  13. 13.

    Latanya Sweeney, Akua Abu, and Julia Winn, “Identifying Participants in the Personal Genome Project by Name” (White Paper 1021-1, Data Privacy Lab, Harvard University, Cambridge, Massachusetts, April 24, 2013), Sweeney and her team matched 22 % of participants based on voter data and 27 % based on a public records website.

  14. 14.

    Ben Adida, “Don’t Hash Secrets,” Benlog, June 19, 2008,; Ed Felten, “Does Hashing Make Data ‘Anonymous’?,” Tech@FTC, April 22, 2012,; Michael N. Gagnon, “Hashing IMEI numbers does not protect privacy,” Dasient Blog, July 26, 2011,

  15. 15.

    Chris Whong, “FOILing NYC’s Taxi Trip Data,” March 18, 2014,

  16. 16.

    Vijay Pandurangan, “On Taxis and Rainbows: Lessons from NYC’s improperly anonymized taxi logs,” Medium, June 21, 2014,

  17. 17.

    Michael Barbaro and Tom Zeller, Jr., “A Face Is Exposed for AOL Searcher No. 4417749,” New York Times, August 9, 2006,

  18. 18.

    Ratan Dey, Yuan Ding, and Keith W. Ross, “The High-School Profiling Attack: How Online Privacy Laws Can Actually Increase Minors’ Risk” (paper presented at the 13th Privacy Enhancing Technologies Symposium, Bloomington, IN, July 12, 2013),; Arvind Narayanan and Vitaly Shmatikov, “De-anonymizing Social Networks,” in Proceedings of the 2009 30th IEEE Symposium on Security and Privacy (Washington, D.C.: IEEE Computer Society, 2009): 173–87.

  19. 19.

    Melissa Gymrek et al., “Identifying Personal Genomes by Surname Inference,” Science 339, no. 6117 (January 2013): 321–24, doi:10.1126/science.1229566.

  20. 20.

    Philippe Golle and Kurt Partridge, “On the Anonymity of Home/Work Location Pairs,” in Pervasive ’09 Proceedings of the 7th International Conference on Pervasive Computing (Berlin, Heidelberg: Springer-Verlag, 2009): 390–97,

  21. 21.

    Alessandro Acquisti, Ralph Gross, and Fred Stutzman, “Faces of Facebook: Privacy in the Age of Augmented Reality” (presentation at BlackHat Las Vegas, Nevada, August 4, 2011). More information can be found in the FAQ on Acquisti’s website:

  22. 22.

    “In the case of high-dimensional data, additional arrangements [beyond de-identification] may need to be pursued, such as making the data available to researchers only under tightly restricted legal agreements.” Ann Cavoukian and Daniel Castro, Big Data and Innovation, Setting the Record Straight: De-identification Does Work (Toronto, Ontario: Information and Privacy Commissioner, June 16, 2014): 3.

  23. 23.

    The median Facebook user has about a hundred friends. Johan Ugander, Brian Karrer, Lars Backstrom, and Cameron Marlow, “The anatomy of the Facebook social graph,” (arXiv Preprint, 2011): 3,

  24. 24.

    There are roughly ten million single nucleotide polymorphisms (SNPs) in the human genome; SNPs are the most common type of human genetic variation. “What are single nucleotide polymorphisms (SNPs)?,” Genetics Home Reference: Your Guide to Understanding Genetic Conditions, published October 20, 2014,

  25. 25.

    DHS Data Privacy and Integrity Advisory Committee FY (2005 )Meeting Materials (June 15, 2005) (statement of Latanya Sweeney, Associate Professor of Computer Science, Technology and Policy and Director of the Data Privacy Laboratory, Carnegie Mellon University),

  26. 26.

    Anthony Tockar, “Riding with the Stars: Passenger Privacy in the NYC Taxicab Dataset,” Neustar: Research, September 15, 2014,

  27. 27.

    Ibid. Tockar goes on to explain how to apply differential privacy to this dataset.

  28. 28.

    James Siddle, “I Know Where You Were Last Summer: London’s public bike data is telling everyone where you’ve been,” The Variable Tree, April 10, 2014,

  29. 29.

    Arvind Narayanan and Vitaly Shmatikov, “Robust de-anonymization of large sparse datasets,” in Proceedings 2008 IEEE Symposium on Security and Privacy, Oakland, California, USA, May 1821, 2008 (Los Alamitos, California: IEEE Computer Society, 2008): 111–25. The Netflix Prize dataset included movies and movie ratings for Netflix users.

  30. 30.

    Yves-Alexandre de Montjoye, et al., “Unique in the Crowd: The privacy bounds of human mobility,” Scientific Reports 3 (March 2013), doi:10.1038/srep01376.

  31. 31.

    Other studies have confirmed that pairs of home and work locations can be used as unique identifiers. Golle and Partridge, “On the anonymity of home/work location pairs;” Hui Zang and Jean Bolot, “Anonymization of location data does not work: A large-scale measurement study,” in Proceedings of the 17th International Conference on Mobile Computing and Networking (New York, New York: ACM, 2011): 145–156.

  32. 32.

    A similar type of chaining in a different context can trace a user’s web browsing history. A network eavesdropper can link the majority a user’s web page visits to the same pseudonymous ID, which can often be linked to a real-world identity. Steven Englehardt et al., “Cookies that give you away: Evaluating the surveillance implications of web tracking,” (paper accepted at 24th International World Wide Web Conference, Florence, May 2015).

  33. 33.

    Sean Hooley and Latanya Sweeney, “Survey of Publicly Available State Health Databases” (White Paper 1075-1, Data Privacy Lab, Harvard University, Cambridge, Massachusetts, June 2013),

  34. 34.

    “Thus, while [Sweeney’s re-identification of Governor Weld] speaks to the inadequacy of certain de-identification methods employed in 1996, to cite it as evidence against current de-identification standards is highly misleading. If anything, it should be cited as evidence for the improvement of de-identification techniques and methods insofar as such attacks are no longer feasible under today’s standards precisely because of this case.” Cavoukian and Castro, De-identification Does Work: 5.

    “Established, published, and peer-reviewed evidence shows that following contemporary good practices for de-identification ensures that the risk of re-identification is very small. In that systematic review (which is the gold standard methodology for summarizing evidence on a given topic) we found that there were 14 known re-identification attacks. Two of those were conducted on data sets that were de-identified with methods that would be defensible (i.e., they followed existing standards). The success rate of the re-identification for these two was very small.” Khaled El Emam and Luk Arbuckle, “Why de-identification is a key solution for sharing data responsibly,” Future of Privacy Forum, July 24, 2014,

  35. 35.

    Gary McGraw and John Viega, “Introduction to Software Security,” InformIT, November 2, 2001,

  36. 36.

    Anup K. Ghosh, Chuck Howell, and James A. Whittaker, “Building Software Securely from the Ground Up,” IEEE Software (January/February 2002): 14–16.

  37. 37.

    For example, the description for a 2012 conference notes that communication between researchers and practitioners is “currently perceived to be quite weak.” “Is Cryptographic Theory Practically Relevant?,” Isaac Newton Institute for Mathematical Sciences, In addition, “[m]odern crypto protocols are too complex to implement securely in software, at least without major leaps in developer know-how and engineering practices.” Arvind Narayanan, “What Happened to the Crypto Dream?, Part 2,” IEEE Security & Privacy 11, no. 3 (2013): 68–71.

  38. 38.

    El Emam and Arbuckle, “Why de-identification is a key solution.”

  39. 39.

    Philippe Golle, “Revisiting the Uniqueness of Simple Demographics in the US Population,” in Proceedings of the 5th ACM Workshop on Privacy in Electronic Society (New York, New York: ACM, 2006): 77–80.

  40. 40.

    Cavoukian and Castro, De-identification Does Work: 4.

  41. 41.

    See Sect. 2.1.

  42. 42.

    Khaled El Emam et al., “De-identification methods for open health data: the case of the Heritage Health Prize claims dataset,” Journal of Medical Internet Research 14, no. 1 (2012): e33, doi:10.2196/jmir.2001.

  43. 43.

    Cavoukian and Castro, De-identification Does Work: 11.

  44. 44.

    El Emam et al., “Heritage Health”.

  45. 45.

    Arvind Narayanan, “An Adversarial Analysis of the Reidentifiability of the Heritage Health Prize Dataset” (unpublished manuscript, 2011).

  46. 46.

    The dataset “contains information on utilization, payment (allowed amount and Medicare payment), and submitted charges organized by National Provider Identifier (NPI), Healthcare Common Procedure Coding System (HCPCS) code, and place of service.” “Medicare Provider Utilization and Payment Data: Physician and Other Supplier,” Centers for Medicare & Medicaid Services, last modified April 23, 2014,

  47. 47.

    “Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule,” U.S. Department of Health & Human Services,

  48. 48.

    The following sources contain introductions to differential privacy. Cynthia Dwork et al., “Differential Privacy—A Primer for the Perplexed” (paper presented at the Joint UNECE/Eurostat work session on statistical data confidentiality, Tarragona, Spain, October 2011); Erica Klarreich, “Privacy by the Numbers: A New Approach to Safeguarding Data,” Quanta Magazine (December 10, 2012); Christine Task, “An Illustrated Primer in Differential Privacy,” XRDS 20, no. 1 (2013): 53–57.

  49. 49.

    Ohm, “Broken Promises of Privacy”: 1752–55.

  50. 50.

    Cass R. Sunstein, “The Paralyzing Principle,” Regulation 25, no. 4 (2002): 33–35.

  51. 51.

    Noah M. Sachs, “Rescuing the Strong Precautionary Principle from Its Critics,” Illinois Law Review 2011 no.4 (2011): 1313.

  52. 52.

    Alternatively, a data provider could show that the expected benefit outweighs the privacy cost of complete re-identification of the entire dataset. In other words, the data provider would need to show that there still would be a net benefit from releasing the data even if the names of all individuals involved were attached to their records in the dataset. This standard would be, in most cases, significantly more restrictive.

  53. 53.

    “OnTheMap,” U.S. Census Bureau,; Klarreich, “Privacy by the Numbers.”.

  54. 54.

    Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova, “RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response,” in Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (Scottsdale, Arizona: ACM, 2014): 1054–67.

  55. 55.

    Jakob Edler and Luke Georghiou, “Public procurement and innovation—Resurrecting the demand side,” Research Policy 36, no. 7 (September 2007): 949–63.

  56. 56.

    Charles Edquist and Jon Mikel Zabala-Iturriagagoitia, “Public Procurement for Innovation as mission-oriented innovation policy,” Research Policy 41, no. 10 (December 2012): 1757–69.

  57. 57.

    Elvira Uyarra and Kieron Flanagan, “Understanding the Innovation Impacts of Public Procurement,” European Planning Studies 18, no. 1 (2010): 123–43.

  58. 58.

    Solove, among others, has discussed how privacy is traditionally viewed as an individual right but also has social value. Daniel J. Solove, “‘I’ve Got Nothing to Hide’ and Other Misunderstandings of Privacy,” San Diego Law Review 44 (2007): 760–64.

  59. 59.

    Information Commissioner’s Office, Big data and data protection (July 28, 2014): 5–6, 33–37.

  60. 60.

    The White House, Consumer Data Privacy in a Networked World: A Framework for Protecting Privacy and Promoting Innovation in the Global Digital Economy (Washington, D.C.: February 2012): 47.

  61. 61.


  62. 62.

    Of course, simply providing information can be insufficient to protect users. It may not “be information that consumers can use, presented in a way they can use it,” and so it may be ignored or misunderstood. Lawrence Lessig, “Against Transparency,” New Republic, October 9, 2009. Alternatively, a user may be informed effectively but the barriers to opting out may be so high as to render the choice illusory. Janet Vertesi, “My Experiment Opting Out of Big Data Made Me Look Like a Criminal,” Time, May 1, 2014. Still, we believe that concise, clear descriptions of privacy protecting measures and re-identification risks can aid users in many circumstances and should be included in the options considered by policymakers.

  63. 63.

    For example, patients in clinical trials or with rare diseases might wish to have their data included for analysis, even if the risk of re-identification is high or if no privacy protecting measures are taken at all. Kerstin Forsberg, “De-identification and Informed Consent in Clinical Trials,” Linked Data for Enterprises, November 17, 2013,

  64. 64.

    For example, the Network Advertising Initiative’s self-regulatory Code “provides disincentives to the use of PII for Interest-Based Advertising. As a result, NAI member companies generally use only information that is not PII for Interest Based Advertising and do not merge the non-PII they collect for Interest-Based Advertising with users’ PII.” “Understanding Online Advertising: Frequently Asked Questions,” Network Advertising Initiative,

  65. 65.

    Balachander Krishnamurthy and Craig E. Wills, “On the Leakage of Personally Identifiable Information Via Online Social Networks,” in Proceedings of the 2nd ACM Workshop on Online Social Networks (New York, New York: ACM, 2009): 7-12,

  66. 66.

    Data aggregation replaces individual data elements by statistical summaries.

  67. 67.

    Cynthia Dwork and Deirdre K. Mulligan, “It's not privacy, and it's not fair,” Stanford Law Review Online 66 (2013): 35.

  68. 68.

    Julia Angwin, “The web’s new gold mine: Your secrets,” Wall Street Journal, July 30, 2010.

  69. 69.

    Aniko Hannak et al., “Measuring Price Discrimination and Steering on E-commerce Web Sites,” in Proceedings of the 2014 Conference on Internet Measurement Conference (Vancouver: ACM, 2014): 305–318.

  70. 70.

    Solon Barocas and Andrew D. Selbst, “Big Data's Disparate Impact,” California Law Review 104 (forthcoming); Ryan Calo, “Digital Market Manipulation,” George Washington Law Review 82 (2014): 995.

  71. 71.

    Directive 95/46/EC, of the European Parliament and of the Council of 24 October 1995 on the Protection of Individuals with Regard to the Processing of Personal Data and on the Free Movement of Such Data, Art. 2(a), 1995 O.J. (C 93).

  72. 72.

    Proposal for a Regulation of the European Parliament and of the Council on the Protection of Individuals with Regard to the Processing of Personal Data and on the Free Movement of Such Data, Art. 4(1)-(2) (January 25, 2012).

  73. 73.

    Ohm, “Broken Promises of Privacy”: 1704, 1738–41.

  74. 74.

    Pablo Barberá, “How Social Medial Reduces Mass Political Polarization: Evidence from Germany, Spain, and the U.S.” (unpublished manuscript, October 18, 2014),; Amaney Jamal et al., “Anti-Americanism and Anti-Interventionism in Arabic Twitter Discourses” (unpublished manuscript, October 20, 2014),; Margaret E. Roberts, “Fear or Friction? How Censorship Slows the Spread of Information in the Digital Age” (unpublished manuscript, September 26, 2014),

    Computational social scientists can also generate their own self-reported data online. Matthew J. Salganik and Karen E.C. Levy, “Wiki surveys: Open and quantifiable social data collection” (unpublished manuscript, October 2, 2014),

  75. 75.

    Kevin J. Boudreau, Nicola Lacetera, and Karim R. Lakhani, “Incentives and Problem Uncertainty in Innovation Contests: An Empirical Analysis,” Management Science 57, no. 5 (2014): 843–63, doi: 10.1287/mnsc.1110.1322.

  76. 76.

    Ibid., 860–61.

  77. 77.

    Lars Bo Jeppesen and Karim R. Lakhani, “Marginality and Problem Solving Effectiveness in Broadcast Search,” Organization Science 21, no. 5 (2010): 1016–33.

  78. 78.

    Researchers already have developed methods for creating such synthetic data. Avrim Blum, Katrina Ligett, and Aaron Roth, “A Learning Theory Approach to Non-Interactive Database Privacy,” in Proceedings of the 40th ACM SIGACT Symposium on Theory of Computing (Victoria, British Columbia: ACM, 2008).

  79. 79.

    “If there are privacy concerns I can imagine ensuring we can share the data in a ‘walled garden’ within which other researchers, but not the public, will be able to access the data and verify results.” Victoria Stodden, “Data access going the way of journal article access? Insist on open data,” Victoria’s Blog, December 24, 2012,

  80. 80.

    Genomics researchers have proposed one such system. Bartha Maria Knoppers, et al., “Towards a data sharing Code of Conduct for international genomic research,” Genome Medicine 3 (2011): 46.

  81. 81.

    HCUP, SID/SASD/SEDD Application Kit (October 15, 2014),

  82. 82.

    Federal Trade Commission, Protecting Consumer Privacy in an Era of Rapid Change: Recommendations for Businesses and Policymakers (Washington, DC: March 2012) 21,


Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Arvind Narayanan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2016 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Narayanan, A., Huey, J., Felten, E.W. (2016). A Precautionary Approach to Big Data Privacy. In: Gutwirth, S., Leenes, R., De Hert, P. (eds) Data Protection on the Move. Law, Governance and Technology Series(), vol 24. Springer, Dordrecht.

Download citation