Skip to main content
Log in

The Mask of ZoRRo: preventing information leakage from documents

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In today’s enterprise world, information about business entities such as a customer’s or patient’s name, address, and social security number is often present in both relational databases as well as content repositories. Information about such business entities is generally well protected in databases by well-defined and fine-grained access control. However, current document retrieval systems do not provide user-specific, fine-grained redaction of documents to prevent leakage of information about business entities from documents. Leaving companies with only two choices: either providing complete access to a document, risking potential information leakage, or prohibiting access to the document altogether, accepting potentially negative impact on business processes. In this paper, we present ZoRRo, an add-on for document retrieval systems to dynamically redact sensitive information of business entities referenced in a document based on access control defined for the entities. ZoRRo exploits database systems’ fine-grained, label-based access-control mechanism to identify and redact sensitive information from unstructured text, based on the access privileges of the user viewing it. To make on-the-fly redaction feasible, ZoRRo exploits the concept of \(k\)-safety in combination with Lucene-based indexing and scoring. We demonstrate the efficiency and effectiveness of ZoRRo through a detailed experimental study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Name withheld due to legal and privacy reasons.

  2. http://www.tpc.org/tpcds/.

  3. The original as well as the redacted documents can be downloaded from https://docs.google.com/file/d/0B_Kz16TTj09IMzVLRGdFMkJwZFU/edit?usp=sharing.

References

  1. Scale out storage in the content driven enterprise: unleashing the value of information assets. http://h17007.www1.hp.com/docs/whatsnew/4AA3-4945ENW.pdf, iDC White Paper, 2010

  2. Code of practice for information security management. iSO/IEC 27002:2005 Information technology—security techniques

  3. Douglass M, Cliffford G, Reisner A, Long W, Moody G, Mark R (2005) Deidentification algorithm for free-text nursing notes. In: Computers in cardiology, S6.2

  4. Jiang W, Murugesan M, Clifton C, Si L (2009) t-Plausibility: semantic preserving text sanitization. In: Proceedings of the 2009 international conference on computational science and engineering, vol 03, pp 68–75

  5. Saygin Y, Hakkani-Tür D, Tür G (2009) Sanitization and anonymization of document repositories. In: Database technologies: concepts, methodologies, tools, and applications, pp 2129–2139

  6. Sweeney L (1996) Replacing personally-identifying information in medical records, the Scrub system. In: Proceedings of the AMIA Annual Fall Symposium, American Medical Informatics Association, pp 333–337

  7. Chakaravarthy VT, Gupta H, Roy P, Mohania MK (2008) Efficient techniques for document sanitization. In: Proceedings of the 17th ACM conference on information and knowledge management, pp 843–852

  8. Bettini C, Wang XS, Jajodia S (2005) Information release control: a learning-based architecture. In: Journal on data semantics II. Springer, pp 176–198

  9. Monteith E (2001) Genoa TIE, advanced boundary controller experiment. In: Proceedings of the 17th annual computer security applications, pp 74–82

  10. Wiederhold G (2002) Protecting information when access is granted for collaboration. In: Data and application security, pp 1–14

  11. Balinsky HY, Simske SJ (2010) Differential access for publicly-posted composite documents with multiple workflow participants. In: Proceedings of the 10th ACM symposium on document engineering, pp 115–124

  12. Balinsky H, Chen L, Simske SJ (2011) Publicly posted composite documents with identity based encryption. In: Proceedings of the 11th ACM symposium on document engineering, pp 239–248

  13. Sahai A, Waters B (2005) Fuzzy identity-based encryption. In: Advances in cryptology-EUROCRYPT. Springer, pp 457–473

  14. Zheng Y (2011) Privacy-preserving personal health record system using attribute-based encryption. Ph.D. Dissertation, Worcester Polytechnic Institute

  15. Wang G, Liu Q, Wu J (2010) Hierarchical attribute-based encryption for fine-grained access control in cloud storage services. In: Proceedings of the 17th ACM conference on computer and communications security. ACM, pp 735–737

  16. Cumby CM, Ghani R (2011) A machine learning based system for semi-automatically redacting documents. In: Proceedings of the 23rd conference on innovative applications of artificial intelligence

  17. Sweeney L (2002) Achieving k-anonymity privacy protection using generalization and suppression. Int J Uncertain Fuzziness Knowl Based Syst 10(5):571–588 [Online]. http://dx.doi.org/10.1142/S021848850200165X

  18. Cumby C, Ghani R (2010) Inference control to protect sensitive information in text documents. In: ACM SIGKDD workshop on intelligence and security informatics, pp 5:1–5:7

  19. Dwork C (2006) Differential privacy. In: Automata, languages and programming. Springer, pp 1–12

  20. McSherry FD (2009) Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the 2009 ACM SIGMOD international conference on management of Data. ACM, pp 19–30

  21. Friedman A, Schuster A (2010) Data mining with differential privacy. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 493–502

  22. IBM InfoSphere Guardium Data Redaction System. Solution Brief. http://public.dhe.ibm.com/common/ssi/ecm/en/ims14137usen/IMS14137USEN.PDF, iBM, 2012

  23. Mansuri IR, Sarawagi S (2006) Integrating unstructured data into relational databases. In: Proceedings of the 22nd international conference on data engineering, p 29

  24. Chakaravarthy VT, Gupta H, Roy P, Mohania MK (2006) Efficiently linking text documents with relevant structured information. In: Proceedings of the 32nd international conference on very large data bases, pp 667–678

  25. Murthy K, Deshpande PM, Dey A, Halasipuram R, Mohania MK, Deepak P, Reed J, Schumacher S (2012) Exploiting evidence from unstructured data to enhance master data management. PVLDB 5(12):1862–1873

    Google Scholar 

  26. Rjaibi W (2006) Label-based access control (LBAC) in DB2 LUW. In: Proceedings of the 2006 international conference on privacy, security and trust, pp 7:1–7:1

  27. Terrovitis M, Mamoulis N, Kalnis P (2008) Privacy-preserving anonymization of set-valued data. Proc VLDB Endowm 1(1):115–125

    Article  Google Scholar 

  28. Xiao X, Tao Y (2006) Anatomy: simple and effective privacy preservation. In: Proceedings of the 32nd international conference on very large data bases, pp 139–150

  29. Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (March 2007) l-diversity: Privacy beyond k-anonymity. ACM Trans Knowl Discov Data. doi:10.1145/1217299.1217302

  30. Li N, Li T (2007) t-Closeness: privacy beyond k-anonymity and l-diversity. In: Proceedings of the 23rd international conference on data engineering

  31. FileNet Content Manager. http://www-01.ibm.com/software/ecm/filenet/, iBM Corporation

  32. Apache Lucene. http://lucene.apache.org/

  33. Apache Tika. http://tika.apache.org/

  34. Apache PDFBox. http://pdfbox.apache.org/

  35. Apache POI. http://poi.apache.org/

  36. IBM DB2 10. http://www-01.ibm.com/software/data/db2-warehouse-10/

  37. Ciravegna F (2001) Adaptive information extraction from text by rule induction and generalisation. In: IJCAI

  38. Denis F (2001) Learning regular languages from simple positive examples. Mach Learn 44(1/2):37–66

    Article  MathSciNet  MATH  Google Scholar 

  39. Fernau H (2009) Algorithms for learning regular expressions from positive data. Inf Comput 207(4):521–541

    Article  MathSciNet  MATH  Google Scholar 

  40. Pöss M, Smith B, Kollár L, Larson P-Å (2005) Tpc-ds, taking decision support benchmarking to the next level. In: Proceedings of the (2002) ACM SIGMOD international conference on management of data, pp 582–587

  41. US Department of Treasury SDN Data. http://www.treasury.gov/resource-center/sanctions/SDN-List/Pages/default.aspx

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Salil Joshi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Deshpande, P.M., Joshi, S., Dewan, P. et al. The Mask of ZoRRo: preventing information leakage from documents. Knowl Inf Syst 45, 705–730 (2015). https://doi.org/10.1007/s10115-014-0811-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-014-0811-6

Keywords

Navigation