Processing Text for Privacy: An Information Flow Perspective

  • Natasha Fernandes
  • Mark Dras
  • Annabelle McIverEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10951)


The problem of text document obfuscation is to provide an automated mechanism which is able to make accessible the content of a text document without revealing the identity of its writer. This is more challenging than it seems, because an adversary equipped with powerful machine learning mechanisms is able to identify authorship (with good accuracy) where, for example, the name of the author has been redacted. Current obfuscation methods are ad hoc and have been shown to provide weak protection against such adversaries. Differential privacy, which is able to provide strong guarantees of privacy in some domains, has been thought not to be applicable to text processing.

In this paper we will review obfuscation as a quantitative information flow problem and explain how generalised differential privacy can be applied to this problem to provide strong anonymisation guarantees in a standard model for text processing.


Refinement Information flow Privacy Probabilistic semantics Text processing Author anonymity Author obfuscation 


  1. 1.
    Abadi, M., Chu, A., Goodfello, I., McMahan, H.B., Mironov, I., Talwar, K., Zhang, L.: Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communication Security (CCS 2016), pp. 303–318, Vienna, Austria, 24–28 October (2016)Google Scholar
  2. 2.
    Alvim, M.S., Chatzikokolakis, K., Degano, P., Palamidessi, C.: Differential privacy versus quantitative information flow. CoRR, abs/1012.4250 (2010)Google Scholar
  3. 3.
    Alvim, M.S., Chatzikokolakis, K., McIver, A., Morgan, C., Palamidessi, C., Smith, G.: Additive and multiplicative notions of leakage, and their capacities. In: IEEE 27th Computer Security Foundations Symposium, CSF 2014, Vienna, Austria, 19–22 July, 2014, pp. 308–322. IEEE (2014)Google Scholar
  4. 4.
    Alvim, M.S., Chatzikokolakis, K., Palamidessi, C., Smith, G.: Measuring information leakage using generalized gain functions. In: Proceedings of the 25th IEEE Computer Security Foundations Symposium (CSF 2012), pp. 265–279, June 2012Google Scholar
  5. 5.
    Alvim, M.S., Scedrov, A., Schneider, F.B.: When notall bits are equal: Worth-based information flow. In: Proceedings of the 3rd Conference on Principles of Security and Trust (POST 2014), pp. 120–139 (2014)Google Scholar
  6. 6.
    Chatzikokolakis, K., Andrés, M.E., Bordenabe, N.E., Palamidessi, C.: Broadening the scope of differential privacy using metrics. In: De Cristofaro, E., Wright, M. (eds.) PETS 2013. LNCS, vol. 7981, pp. 82–102. Springer, Heidelberg (2013). Scholar
  7. 7.
    Chaudhuri, K., Monteleoni, C., Sarwate, A.D.: Differentially private empirical risk minimization. J. Mach. Learn. Res. 12, 1069–1109 (2011)MathSciNetzbMATHGoogle Scholar
  8. 8.
    Cumby, C., Ghani, R.: A machine learning based system for semi-automatically redacting documents. In: Proceedings of the Twenty-Third Conference on Innovative Applications of Artificial Intelligence (IAAI) (2011)Google Scholar
  9. 9.
    Dalenius, T.: Towards a methodology for statistical disclosure control. Statistik Tidskrift 15, 429–44 (1977)Google Scholar
  10. 10.
    Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9(3–4), 211–407 (2014)MathSciNetzbMATHGoogle Scholar
  11. 11.
    Ganta, S.R., Kasiviswanathan, S.P., Smith, A.: Composition attacks and auxiliary information in data privacy. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 265–273. ACM (2008)Google Scholar
  12. 12.
    Iyyer, M., Wieting, J., Gimpel, K., Zettlemoyer, L.: Adversarial example generation with syntactically controlled paraphrase networks. In: North American Association for Computational Linguistics (to appear, 2018)Google Scholar
  13. 13.
    Khonji, M., Iraqi, Y.: A Slightly-modified GI-based Author-verifier with Lots of Features (ASGALF). In: Working Notes for CLEF 2014 Conference (2014)Google Scholar
  14. 14.
    Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. JASIST 60(1), 9–26 (2009)CrossRefGoogle Scholar
  15. 15.
    Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Lang. Resour. Eval. 45(1), 83–94 (2011)CrossRefGoogle Scholar
  16. 16.
    Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 957–966 (2015)Google Scholar
  17. 17.
    Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)Google Scholar
  18. 18.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, USA (1999)zbMATHGoogle Scholar
  19. 19.
    McIver, A., Meinicke, L., Morgan, C.: Compositional closure for bayes risk in probabilistic noninterference. In: Abramsky, S., Gavoille, C., Kirchner, C., Meyer auf der Heide, F., Spirakis, P.G. (eds.) ICALP 2010. LNCS, vol. 6199, pp. 223–235. Springer, Heidelberg (2010). Scholar
  20. 20.
    McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 94–103. IEEE (2007)Google Scholar
  21. 21.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 3111–3119. Curran Associates Inc. (2013)Google Scholar
  22. 22.
    Mosteller, F., Wallace, D.L.: Inference in an authorship problem: a comparative study of discrimination methods applied to the authorship of the disputed federalist papers. J. Am. Stat. Assoc. 58(302), 275–309 (1963)zbMATHGoogle Scholar
  23. 23.
    Sánchez, D., Batet, M.: C-sanitized: a privacy model for document redaction and sanitization. J. Assoc. Inf. Sci. Technol. 67(1), 148–163 (2016)CrossRefGoogle Scholar
  24. 24.
    Seidman, S.: Authorship Verification Using the Imposters Method. In: Working Notes for CLEF 2013 Conference (2013)Google Scholar
  25. 25.
    Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertainty, Fuzziness Knowl. Based Syst. 10(5), 557–570 (2002)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent trends in deep learning based natural language processing. CoRR, abs/1708.02709 (2017)Google Scholar
  27. 27.
    Zhao, Z., Dua, D., Singh, S.: Generating natural adversarial examples. In: International Conference on Learning Representations (2018)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Natasha Fernandes
    • 1
  • Mark Dras
    • 1
  • Annabelle McIver
    • 1
    Email author
  1. 1.Department of ComputingMacquarie UniversityNorth RydeAustralia

Personalised recommendations