Advertisement

Confidential Terms Detection Using Language Modeling Technique in Data Leakage Prevention

  • Peneti Subhashini
  • B. Padmaja Rani
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 381)

Abstract

Confidential documents detection is a key activity in data leakage prevention methods. Once the document is marked as confidential, then it is possible to prevent data leakage from that document. Confidential terms are significant terms, which indicate confidential content in the document. This paper presents confidential terms detection method using language model with Dirichlet prior smoothing technique. Clusters are generated for training dataset documents (confidential and nonconfidential documents). Language model is created separately for confidential and nonconfidential documents. Expand nonconfidential language model in a cluster using similar clusters, which helps to identify the confidential content in the nonconfidential documents. Smoothing assigns a nonzero probability value to unseen words and improves accuracy of the language model.

Keywords

Data leakage prevention Confidential terms Language model Smoothing Confidential score 

References

  1. 1.
    Shabtai, A., Elovici, Y., Rokach, L: A Survey of Data Leakage Detection and Prevention Solutions. Springer Briefs in Computer Science. Springer, New Work (2012) Google Scholar
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
    Ouellet, E., Proctor, P.E.: Magic quadrant for content-aware data loss prevention. Technical Report, RA4 06242010, Gartner RAS Core Research (2009)Google Scholar
  8. 8.
    Katz, G., Elovici, Y., Shapira, B.: CoBAn: a context based model for data leakage prevention. Info. Sci. 262, 107–128 (2011)Google Scholar
  9. 9.
    Zilberman, P., Shabtai, A., Rokach, L.: Analyzing group communication for preventing data leakage via email. IEEE (2011)Google Scholar
  10. 10.
  11. 11.
    Steinbach, M., Karypis, G., Vipin K.: A comparison of document clustering techniques. Technical Report #00–034Google Scholar
  12. 12.
    Song, F., Croft, W.: A general language model for information retrieval. In: Proceedings of the 8th International Conference on Information and Knowledge Management, pp. 310–321. ACM, Kanasas City, Missouri, United States (1999)Google Scholar
  13. 13.
    Ponte, J., Croft, W.: A language modeling approach to information retrieval. In: Proceeding of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281. ACM, Melbourne, Australia (1998)Google Scholar
  14. 14.
    Lavrenko, V., Croft, W.: Relevance based language models. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 120–127.ACM, New Orleans, Louisiana, United States (2001)Google Scholar
  15. 15.
    Zhai, J., Lafferty.: A study of smoothing methods for language models applied to adhoc information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (2001)Google Scholar
  16. 16.
    Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modelling. Tech Report.TR-10-98, Harward UniversityGoogle Scholar
  17. 17.

Copyright information

© Springer India 2016

Authors and Affiliations

  1. 1.Computer Science EngineeringJawaharlal Nehru Technological UniversityHyderabadIndia

Personalised recommendations