Provable De-anonymization of Large Datasets with Sparse Dimensions

  • Anupam Datta
  • Divya Sharma
  • Arunesh Sinha
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7215)


There is a significant body of empirical work on statistical de-anonymization attacks against databases containing micro-data about individuals, e.g., their preferences, movie ratings, or transaction data. Our goal is to analytically explain why such attacks work. Specifically, we analyze a variant of the Narayanan-Shmatikov algorithm that was used to effectively de-anonymize the Netflix database of movie ratings. We prove theorems characterizing mathematical properties of the database and the auxiliary information available to the adversary that enable two classes of privacy attacks. In the first attack, the adversary successfully identifies the individual about whom she possesses auxiliary information (an isolation attack). In the second attack, the adversary learns additional information about the individual, although she may not be able to uniquely identify him (an information amplification attack). We demonstrate the applicability of the analytical results by empirically verifying that the mathematical properties assumed of the database are actually true for a significant fraction of the records in the Netflix movie ratings database, which contains ratings from about 500,000 users.


Privacy database de-anonymization 


  1. 1.
    PACER- Public Access to Court Electronic Records, (last accessed December 16, 2011)
  2. 2.
    Barbaro, M., Zeller, T.: A Face Is Exposed for AOL Searcher No. 4417749. New York Times (August 09, 2006),
  3. 3.
    Boreale, M., Pampaloni, F., Paolini, M.: Quantitative Information Flow, with a View. In: Atluri, V., Diaz, C. (eds.) ESORICS 2011. LNCS, vol. 6879, pp. 588–606. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  4. 4.
    Dalenius, T.: Towards a methodology for statistical disclosure control. Statistics Tidskrift 15, 429–444 (1977)Google Scholar
  5. 5.
    Dwork, C.: Differential Privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  6. 6.
    Dwork, C.: Differential Privacy: A Survey of Results. In: Agrawal, M., Du, D.-Z., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008), CrossRefGoogle Scholar
  7. 7.
    Frankowski, D., Cosley, D., Sen, S., Terveen, L., Riedl, J.: You are What You Say: Privacy Risks of Public Mentions. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006, pp. 565–572. ACM, New York (2006), CrossRefGoogle Scholar
  8. 8.
    Hafner, K.: And if You Liked the Movie, a Netflix Contest May Reward You Handsomely. New York Times (October 02, 2006),
  9. 9.
    Li, N., Li, T., Venkatasubramanian, S.: t-closeness: Privacy beyond k-anonymity and l-diversity. In: IEEE 23rd International Conference on Data Engineering, ICDE 2007, pp. 106–115 (April 2007)Google Scholar
  10. 10.
    Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: L-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1 (March 2007),
  11. 11.
    Narayanan, A., Shmatikov, V.: Robust De-anonymization of Large Sparse Datasets. In: Proceedings of the 2008 IEEE Symposium on Security and Privacy, pp. 111–125. IEEE Computer Society, Washington, DC (2008), Google Scholar
  12. 12.
    Narayanan, A., Shmatikov, V.: Myths and fallacies of personally identifiable information. Communications of the ACM 53, 24–26 (2010)CrossRefGoogle Scholar
  13. 13.
    Samarati, P.: Protecting respondents’ identities in microdata release. IEEE Trans. on Knowl. and Data Eng. 13, 1010–1027 (2001), CrossRefGoogle Scholar
  14. 14.
    Schwarz, H.A.: ber ein Flchen kleinsten Flcheninhalts betreffendes Problem der Variationsrechnung. Acta Societatis Scientiarum Fennicae XV, 318 (1888)Google Scholar
  15. 15.
    Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertainty, Fuzziness and Knowledge-Based System 10, 571–588 (2002), MathSciNetzbMATHCrossRefGoogle Scholar
  16. 16.
    Sweeney, L.: k-anonymity: a Model for Protecting Privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10, 557–570 (2002), MathSciNetzbMATHCrossRefGoogle Scholar
  17. 17.
    Xiao, X., Tao, Y.: M-invariance: towards privacy preserving re-publication of dynamic datasets. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD 2007, pp. 689–700. ACM, New York (2007), CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Anupam Datta
    • 1
  • Divya Sharma
    • 1
  • Arunesh Sinha
    • 1
  1. 1.Carnegie Mellon UniversityUSA

Personalised recommendations