The Mask of ZoRRo: preventing information leakage from documents

Deshpande, Prasad M.; Joshi, Salil; Dewan, Prateek; Murthy, Karin; Mohania, Mukesh; Agrawal, Sheshnarayan

doi:10.1007/s10115-014-0811-6

The Mask of ZoRRo: preventing information leakage from documents

Regular Paper
Published: 24 December 2014

Volume 45, pages 705–730, (2015)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Prasad M. Deshpande¹,
Salil Joshi¹,
Prateek Dewan³,
Karin Murthy¹,
Mukesh Mohania² &
…
Sheshnarayan Agrawal⁴

474 Accesses
5 Citations
Explore all metrics

Abstract

In today’s enterprise world, information about business entities such as a customer’s or patient’s name, address, and social security number is often present in both relational databases as well as content repositories. Information about such business entities is generally well protected in databases by well-defined and fine-grained access control. However, current document retrieval systems do not provide user-specific, fine-grained redaction of documents to prevent leakage of information about business entities from documents. Leaving companies with only two choices: either providing complete access to a document, risking potential information leakage, or prohibiting access to the document altogether, accepting potentially negative impact on business processes. In this paper, we present ZoRRo, an add-on for document retrieval systems to dynamically redact sensitive information of business entities referenced in a document based on access control defined for the entities. ZoRRo exploits database systems’ fine-grained, label-based access-control mechanism to identify and redact sensitive information from unstructured text, based on the access privileges of the user viewing it. To make on-the-fly redaction feasible, ZoRRo exploits the concept of \(k\)-safety in combination with Lucene-based indexing and scoring. We demonstrate the efficiency and effectiveness of ZoRRo through a detailed experimental study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

Name withheld due to legal and privacy reasons.
http://www.tpc.org/tpcds/.
The original as well as the redacted documents can be downloaded from https://docs.google.com/file/d/0B_Kz16TTj09IMzVLRGdFMkJwZFU/edit?usp=sharing.

References

Scale out storage in the content driven enterprise: unleashing the value of information assets. http://h17007.www1.hp.com/docs/whatsnew/4AA3-4945ENW.pdf, iDC White Paper, 2010
Code of practice for information security management. iSO/IEC 27002:2005 Information technology—security techniques
Douglass M, Cliffford G, Reisner A, Long W, Moody G, Mark R (2005) Deidentification algorithm for free-text nursing notes. In: Computers in cardiology, S6.2
Jiang W, Murugesan M, Clifton C, Si L (2009) t-Plausibility: semantic preserving text sanitization. In: Proceedings of the 2009 international conference on computational science and engineering, vol 03, pp 68–75
Saygin Y, Hakkani-Tür D, Tür G (2009) Sanitization and anonymization of document repositories. In: Database technologies: concepts, methodologies, tools, and applications, pp 2129–2139
Sweeney L (1996) Replacing personally-identifying information in medical records, the Scrub system. In: Proceedings of the AMIA Annual Fall Symposium, American Medical Informatics Association, pp 333–337
Chakaravarthy VT, Gupta H, Roy P, Mohania MK (2008) Efficient techniques for document sanitization. In: Proceedings of the 17th ACM conference on information and knowledge management, pp 843–852
Bettini C, Wang XS, Jajodia S (2005) Information release control: a learning-based architecture. In: Journal on data semantics II. Springer, pp 176–198
Monteith E (2001) Genoa TIE, advanced boundary controller experiment. In: Proceedings of the 17th annual computer security applications, pp 74–82
Wiederhold G (2002) Protecting information when access is granted for collaboration. In: Data and application security, pp 1–14
Balinsky HY, Simske SJ (2010) Differential access for publicly-posted composite documents with multiple workflow participants. In: Proceedings of the 10th ACM symposium on document engineering, pp 115–124
Balinsky H, Chen L, Simske SJ (2011) Publicly posted composite documents with identity based encryption. In: Proceedings of the 11th ACM symposium on document engineering, pp 239–248
Sahai A, Waters B (2005) Fuzzy identity-based encryption. In: Advances in cryptology-EUROCRYPT. Springer, pp 457–473
Zheng Y (2011) Privacy-preserving personal health record system using attribute-based encryption. Ph.D. Dissertation, Worcester Polytechnic Institute
Wang G, Liu Q, Wu J (2010) Hierarchical attribute-based encryption for fine-grained access control in cloud storage services. In: Proceedings of the 17th ACM conference on computer and communications security. ACM, pp 735–737
Cumby CM, Ghani R (2011) A machine learning based system for semi-automatically redacting documents. In: Proceedings of the 23rd conference on innovative applications of artificial intelligence
Sweeney L (2002) Achieving k-anonymity privacy protection using generalization and suppression. Int J Uncertain Fuzziness Knowl Based Syst 10(5):571–588 [Online]. http://dx.doi.org/10.1142/S021848850200165X
Cumby C, Ghani R (2010) Inference control to protect sensitive information in text documents. In: ACM SIGKDD workshop on intelligence and security informatics, pp 5:1–5:7
Dwork C (2006) Differential privacy. In: Automata, languages and programming. Springer, pp 1–12
McSherry FD (2009) Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the 2009 ACM SIGMOD international conference on management of Data. ACM, pp 19–30
Friedman A, Schuster A (2010) Data mining with differential privacy. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 493–502
IBM InfoSphere Guardium Data Redaction System. Solution Brief. http://public.dhe.ibm.com/common/ssi/ecm/en/ims14137usen/IMS14137USEN.PDF, iBM, 2012
Mansuri IR, Sarawagi S (2006) Integrating unstructured data into relational databases. In: Proceedings of the 22nd international conference on data engineering, p 29
Chakaravarthy VT, Gupta H, Roy P, Mohania MK (2006) Efficiently linking text documents with relevant structured information. In: Proceedings of the 32nd international conference on very large data bases, pp 667–678
Murthy K, Deshpande PM, Dey A, Halasipuram R, Mohania MK, Deepak P, Reed J, Schumacher S (2012) Exploiting evidence from unstructured data to enhance master data management. PVLDB 5(12):1862–1873
Google Scholar
Rjaibi W (2006) Label-based access control (LBAC) in DB2 LUW. In: Proceedings of the 2006 international conference on privacy, security and trust, pp 7:1–7:1
Terrovitis M, Mamoulis N, Kalnis P (2008) Privacy-preserving anonymization of set-valued data. Proc VLDB Endowm 1(1):115–125
Article Google Scholar
Xiao X, Tao Y (2006) Anatomy: simple and effective privacy preservation. In: Proceedings of the 32nd international conference on very large data bases, pp 139–150
Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (March 2007) l-diversity: Privacy beyond k-anonymity. ACM Trans Knowl Discov Data. doi:10.1145/1217299.1217302
Li N, Li T (2007) t-Closeness: privacy beyond k-anonymity and l-diversity. In: Proceedings of the 23rd international conference on data engineering
FileNet Content Manager. http://www-01.ibm.com/software/ecm/filenet/, iBM Corporation
Apache Lucene. http://lucene.apache.org/
Apache Tika. http://tika.apache.org/
Apache PDFBox. http://pdfbox.apache.org/
Apache POI. http://poi.apache.org/
IBM DB2 10. http://www-01.ibm.com/software/data/db2-warehouse-10/
Ciravegna F (2001) Adaptive information extraction from text by rule induction and generalisation. In: IJCAI
Denis F (2001) Learning regular languages from simple positive examples. Mach Learn 44(1/2):37–66
Article MathSciNet MATH Google Scholar
Fernau H (2009) Algorithms for learning regular expressions from positive data. Inf Comput 207(4):521–541
Article MathSciNet MATH Google Scholar
Pöss M, Smith B, Kollár L, Larson P-Å (2005) Tpc-ds, taking decision support benchmarking to the next level. In: Proceedings of the (2002) ACM SIGMOD international conference on management of data, pp 582–587
US Department of Treasury SDN Data. http://www.treasury.gov/resource-center/sanctions/SDN-List/Pages/default.aspx

Download references

Author information

Authors and Affiliations

IBM Research, Bangalore, India
Prasad M. Deshpande, Salil Joshi & Karin Murthy
IBM Research, Delhi, India
Mukesh Mohania
Indraprastha Institute of Information Technology, Delhi, India
Prateek Dewan
IBM, Bangalore, India
Sheshnarayan Agrawal

Authors

Prasad M. Deshpande
View author publications
You can also search for this author in PubMed Google Scholar
Salil Joshi
View author publications
You can also search for this author in PubMed Google Scholar
Prateek Dewan
View author publications
You can also search for this author in PubMed Google Scholar
Karin Murthy
View author publications
You can also search for this author in PubMed Google Scholar
Mukesh Mohania
View author publications
You can also search for this author in PubMed Google Scholar
Sheshnarayan Agrawal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Salil Joshi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Deshpande, P.M., Joshi, S., Dewan, P. et al. The Mask of ZoRRo: preventing information leakage from documents. Knowl Inf Syst 45, 705–730 (2015). https://doi.org/10.1007/s10115-014-0811-6

Download citation

Received: 07 May 2014
Revised: 19 November 2014
Accepted: 10 December 2014
Published: 24 December 2014
Issue Date: December 2015
DOI: https://doi.org/10.1007/s10115-014-0811-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Mask of ZoRRo: preventing information leakage from documents

Abstract

Access this article

Similar content being viewed by others

A survey of density based clustering algorithms

Coordinating Decision-Making in Data Management Activities: A Systematic Review of Data Governance Principles

A key review on security and privacy of big data: issues, challenges, and future research directions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The Mask of ZoRRo: preventing information leakage from documents

Abstract

Access this article

Similar content being viewed by others

A survey of density based clustering algorithms

Coordinating Decision-Making in Data Management Activities: A Systematic Review of Data Governance Principles

A key review on security and privacy of big data: issues, challenges, and future research directions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation