Skip to main content

A Semantics-Aware Classification Approach for Data Leakage Prevention

  • Conference paper
Information Security and Privacy (ACISP 2014)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 8544))

Included in the following conference series:

Abstract

Data leakage prevention (DLP) is an emerging subject in the field of information security. It deals with tools working under a central policy, which analyze networked environments to detect sensitive data, prevent unauthorized access to it and block channels associated with data leak. This requires special data classification capabilities to distinguish between sensitive and normal data. Not only this task needs prior knowledge of the sensitive data, but also requires knowledge of potentially evolved and unknown data. Most current DLPs use content-based analysis in order to detect sensitive data. This mainly involves the use of regular expressions and data fingerprinting. Although these content analysis techniques are robust in detecting known unmodified data, they usually become ineffective if the sensitive data is not known before or largely modified. In this paper we study the effectiveness of using N-gram based statistical analysis, fostered by the use of stem words, in classifying documents according to their topics. The results are promising with an overall classification accuracy of 92%. Also we discuss classification deterioration when the text is exposed to multiple spins that simulate data modification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Raman, P., Kayacık, H.G., Somayaji, A.: Understanding Data Leak Prevention. In: 6th Annual Symposium on Information Assurance (ASIA 2011), p. 27 (2011)

    Google Scholar 

  2. Mogull, R.: Understanding and Selecting a Data Loss Prevention Solution, https://securosis.com/assets/library/reports/DLP-Whitepaper.pdf

  3. Shapira, Y., Shapira, B., Shabtai, A.: Content-based data leakage detection using extended fingerprinting. arXiv preprint arXiv:1302.2028 (2013)

    Google Scholar 

  4. Kantor, A., Antebi, L., Kirsch, Y., Bialik, U.: Methods for document-to-template matching for data-leak prevention. USA Patent US20100254615 A1 (2009)

    Google Scholar 

  5. Roussev, V.: Data fingerprinting with similarity digests. In: Chow, K.-P., Shenoi, S. (eds.) Advances in Digital ForensicsVI. IFIP AICT, vol. 337, pp. 207–226. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  6. Shu, X., Yao, D. D.: Data leak detection as a service. In: Keromytis, A.D., Di Pietro, R. (eds.) SecureComm 2012. LNICST, vol. 106, pp. 222–240. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  7. Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digital Investigation 3, 91–97 (2006)

    Article  Google Scholar 

  8. Borders, K., Prakash, A.: Quantifying information leaks in outbound web traffic. In: 30th IEEE Symposium 2009 Security and Privacy, pp. 129–140 (2009)

    Google Scholar 

  9. Clark, D., Hunt, S., Malacaria, P.: Quantitative analysis of the leakage of confidential data. Electronic Notes in Theoretical Computer Science 59 (2002)

    Google Scholar 

  10. Hart, M., Manadhata, P., Johnson, R.: Text classification for data loss prevention. In: Fischer-Hübner, S., Hopper, N. (eds.) PETS 2011. LNCS, vol. 6794, pp. 18–37. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  11. Carvalho, V.R., Balasubramanyan, R., Cohen, W.W.: Information Leaks and Suggestions: A Case Study using Mozilla Thunderbird. In: Proc. of 6th Conf. on Email and Antispam (2009)

    Google Scholar 

  12. Zipf, G.K.: Human behavior and the principle of least effort. Addison Wesley, Massachusetts (1949)

    Google Scholar 

  13. Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. Presented at the Ann Arbor MI (1994)

    Google Scholar 

  14. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24, 513–523 (1988)

    Article  Google Scholar 

  15. Alneyadi, S., Sithirasenan, E., Muthukkumarasamy, V.: Word N-gram Based Classification for Data Leakage Prevention. In: TrustCom, Melbourne (2013)

    Google Scholar 

  16. Holme, P.: Peter Holme’s word stemmer (2011), http://holme.se/stem/

  17. Porter, M.F.: An algorithm for suffix stripping. Program: Electronic Library and Information Systems 14, 130–137 (1980)

    Article  Google Scholar 

  18. Alneyadi, S., Sithirasenan, E., Muthukkumarasamy, V.: Adaptable N-gram Classification Model for Data Leakage Prevention. Presented at the ICSPCS, Gold Coast, Australia(2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Alneyadi, S., Sithirasenan, E., Muthukkumarasamy, V. (2014). A Semantics-Aware Classification Approach for Data Leakage Prevention. In: Susilo, W., Mu, Y. (eds) Information Security and Privacy. ACISP 2014. Lecture Notes in Computer Science, vol 8544. Springer, Cham. https://doi.org/10.1007/978-3-319-08344-5_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-08344-5_27

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-08343-8

  • Online ISBN: 978-3-319-08344-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics