Knowledge and Information Systems

, Volume 45, Issue 3, pp 705–730 | Cite as

The Mask of ZoRRo: preventing information leakage from documents

  • Prasad M. Deshpande
  • Salil Joshi
  • Prateek Dewan
  • Karin Murthy
  • Mukesh Mohania
  • Sheshnarayan Agrawal
Regular Paper


In today’s enterprise world, information about business entities such as a customer’s or patient’s name, address, and social security number is often present in both relational databases as well as content repositories. Information about such business entities is generally well protected in databases by well-defined and fine-grained access control. However, current document retrieval systems do not provide user-specific, fine-grained redaction of documents to prevent leakage of information about business entities from documents. Leaving companies with only two choices: either providing complete access to a document, risking potential information leakage, or prohibiting access to the document altogether, accepting potentially negative impact on business processes. In this paper, we present ZoRRo, an add-on for document retrieval systems to dynamically redact sensitive information of business entities referenced in a document based on access control defined for the entities. ZoRRo exploits database systems’ fine-grained, label-based access-control mechanism to identify and redact sensitive information from unstructured text, based on the access privileges of the user viewing it. To make on-the-fly redaction feasible, ZoRRo exploits the concept of \(k\)-safety in combination with Lucene-based indexing and scoring. We demonstrate the efficiency and effectiveness of ZoRRo through a detailed experimental study.


Sanitization Redaction Security and protection 


  1. 1.
    Scale out storage in the content driven enterprise: unleashing the value of information assets., iDC White Paper, 2010
  2. 2.
    Code of practice for information security management. iSO/IEC 27002:2005 Information technology—security techniquesGoogle Scholar
  3. 3.
    Douglass M, Cliffford G, Reisner A, Long W, Moody G, Mark R (2005) Deidentification algorithm for free-text nursing notes. In: Computers in cardiology, S6.2Google Scholar
  4. 4.
    Jiang W, Murugesan M, Clifton C, Si L (2009) t-Plausibility: semantic preserving text sanitization. In: Proceedings of the 2009 international conference on computational science and engineering, vol 03, pp 68–75Google Scholar
  5. 5.
    Saygin Y, Hakkani-Tür D, Tür G (2009) Sanitization and anonymization of document repositories. In: Database technologies: concepts, methodologies, tools, and applications, pp 2129–2139Google Scholar
  6. 6.
    Sweeney L (1996) Replacing personally-identifying information in medical records, the Scrub system. In: Proceedings of the AMIA Annual Fall Symposium, American Medical Informatics Association, pp 333–337Google Scholar
  7. 7.
    Chakaravarthy VT, Gupta H, Roy P, Mohania MK (2008) Efficient techniques for document sanitization. In: Proceedings of the 17th ACM conference on information and knowledge management, pp 843–852Google Scholar
  8. 8.
    Bettini C, Wang XS, Jajodia S (2005) Information release control: a learning-based architecture. In: Journal on data semantics II. Springer, pp 176–198Google Scholar
  9. 9.
    Monteith E (2001) Genoa TIE, advanced boundary controller experiment. In: Proceedings of the 17th annual computer security applications, pp 74–82Google Scholar
  10. 10.
    Wiederhold G (2002) Protecting information when access is granted for collaboration. In: Data and application security, pp 1–14Google Scholar
  11. 11.
    Balinsky HY, Simske SJ (2010) Differential access for publicly-posted composite documents with multiple workflow participants. In: Proceedings of the 10th ACM symposium on document engineering, pp 115–124Google Scholar
  12. 12.
    Balinsky H, Chen L, Simske SJ (2011) Publicly posted composite documents with identity based encryption. In: Proceedings of the 11th ACM symposium on document engineering, pp 239–248Google Scholar
  13. 13.
    Sahai A, Waters B (2005) Fuzzy identity-based encryption. In: Advances in cryptology-EUROCRYPT. Springer, pp 457–473Google Scholar
  14. 14.
    Zheng Y (2011) Privacy-preserving personal health record system using attribute-based encryption. Ph.D. Dissertation, Worcester Polytechnic InstituteGoogle Scholar
  15. 15.
    Wang G, Liu Q, Wu J (2010) Hierarchical attribute-based encryption for fine-grained access control in cloud storage services. In: Proceedings of the 17th ACM conference on computer and communications security. ACM, pp 735–737Google Scholar
  16. 16.
    Cumby CM, Ghani R (2011) A machine learning based system for semi-automatically redacting documents. In: Proceedings of the 23rd conference on innovative applications of artificial intelligenceGoogle Scholar
  17. 17.
    Sweeney L (2002) Achieving k-anonymity privacy protection using generalization and suppression. Int J Uncertain Fuzziness Knowl Based Syst 10(5):571–588 [Online].
  18. 18.
    Cumby C, Ghani R (2010) Inference control to protect sensitive information in text documents. In: ACM SIGKDD workshop on intelligence and security informatics, pp 5:1–5:7Google Scholar
  19. 19.
    Dwork C (2006) Differential privacy. In: Automata, languages and programming. Springer, pp 1–12Google Scholar
  20. 20.
    McSherry FD (2009) Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the 2009 ACM SIGMOD international conference on management of Data. ACM, pp 19–30Google Scholar
  21. 21.
    Friedman A, Schuster A (2010) Data mining with differential privacy. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 493–502Google Scholar
  22. 22.
    IBM InfoSphere Guardium Data Redaction System. Solution Brief., iBM, 2012
  23. 23.
    Mansuri IR, Sarawagi S (2006) Integrating unstructured data into relational databases. In: Proceedings of the 22nd international conference on data engineering, p 29Google Scholar
  24. 24.
    Chakaravarthy VT, Gupta H, Roy P, Mohania MK (2006) Efficiently linking text documents with relevant structured information. In: Proceedings of the 32nd international conference on very large data bases, pp 667–678Google Scholar
  25. 25.
    Murthy K, Deshpande PM, Dey A, Halasipuram R, Mohania MK, Deepak P, Reed J, Schumacher S (2012) Exploiting evidence from unstructured data to enhance master data management. PVLDB 5(12):1862–1873Google Scholar
  26. 26.
    Rjaibi W (2006) Label-based access control (LBAC) in DB2 LUW. In: Proceedings of the 2006 international conference on privacy, security and trust, pp 7:1–7:1Google Scholar
  27. 27.
    Terrovitis M, Mamoulis N, Kalnis P (2008) Privacy-preserving anonymization of set-valued data. Proc VLDB Endowm 1(1):115–125CrossRefGoogle Scholar
  28. 28.
    Xiao X, Tao Y (2006) Anatomy: simple and effective privacy preservation. In: Proceedings of the 32nd international conference on very large data bases, pp 139–150Google Scholar
  29. 29.
    Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (March 2007) l-diversity: Privacy beyond k-anonymity. ACM Trans Knowl Discov Data. doi: 10.1145/1217299.1217302
  30. 30.
    Li N, Li T (2007) t-Closeness: privacy beyond k-anonymity and l-diversity. In: Proceedings of the 23rd international conference on data engineeringGoogle Scholar
  31. 31.
    FileNet Content Manager., iBM Corporation
  32. 32.
  33. 33.
  34. 34.
  35. 35.
  36. 36.
  37. 37.
    Ciravegna F (2001) Adaptive information extraction from text by rule induction and generalisation. In: IJCAIGoogle Scholar
  38. 38.
    Denis F (2001) Learning regular languages from simple positive examples. Mach Learn 44(1/2):37–66MathSciNetCrossRefzbMATHGoogle Scholar
  39. 39.
    Fernau H (2009) Algorithms for learning regular expressions from positive data. Inf Comput 207(4):521–541MathSciNetCrossRefzbMATHGoogle Scholar
  40. 40.
    Pöss M, Smith B, Kollár L, Larson P-Å (2005) Tpc-ds, taking decision support benchmarking to the next level. In: Proceedings of the (2002) ACM SIGMOD international conference on management of data, pp 582–587Google Scholar
  41. 41.

Copyright information

© Springer-Verlag London 2014

Authors and Affiliations

  • Prasad M. Deshpande
    • 1
  • Salil Joshi
    • 1
  • Prateek Dewan
    • 3
  • Karin Murthy
    • 1
  • Mukesh Mohania
    • 2
  • Sheshnarayan Agrawal
    • 4
  1. 1.IBM ResearchBangaloreIndia
  2. 2.IBM ResearchDelhiIndia
  3. 3.Indraprastha Institute of Information TechnologyDelhiIndia
  4. 4.IBMBangaloreIndia

Personalised recommendations