Skip to main content
Log in

Applying text mining methods for data loss prevention

  • Published:
Programming and Computer Software Aims and scope Submit manuscript

Abstract

Currently, the greatest risks for information security of organizations are internal, rather than external, threats. Data loss prevention (DLP) systems are used for minimization of risks related to internal threats. The main function of the DLP systems is to prevent leak of confidential data; however, comparison of the DLP systems relies currently on their capabilities to analyze information captured and convenience of carrying out retrospective investigations of information security incident. In the paper, a new approach to retrospective analysis of user’s text information is presented. The idea of the proposed approach consists in topic analysis of the text content processed by the user in the past and prediction of further user behavior with content. User text content can cover different categories, including confidential ones. The topic analysis of user text content assumes determination of main topics and their weights for given past time intervals. Based on deviations of behavior of user’s operations with a content from the forecast, one can reveal time intervals when operation with documents of one or another category differs from normal (historical) work and when the user worked with documents of unusual categories. The proposed approach was experimentally verified on an example of actual corporate email correspondence created from the Enron data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Analytical center InfoWatch. Information security in corporate information systems. Internal threats. http://www.infowatch.ru/analytics/reports/4609.

  2. Smirnov, G., Specific features of ensuring information security for small and medium business. http://www.anti-malware.ru/Small-Business-Security

  3. Ouellet, E., Magic Quadrant for Content-Aware Data Loss Prevention, Gartner.

  4. Symantec data loss prevention 11, White Paper: Machine Learning Sets New Standard for Data Loss Prevention: Describe, Fingerprint, Learn. http://www.symantec.com/about/news/resources/press-kits/detail.jsp?pkid=DLP11.

  5. Linguistic analysis in the InfoWatch solutions. http://www.infowatch.ru/technologies/linguistic-analysis.

  6. Shabanov, I., Comparison of data leak prevention (DLP) systems, 2014, Part 1. http://www.anti-malware.ru/comparisons/data-leak-protection-2014-part1.

    Google Scholar 

  7. Shabanov, I., Comparison of data leak prevention (DLP) systems, 2014, Part 2. http://www.anti-mal-ware.ru/comparisons/data-leak-protection-2014-part2.

    Google Scholar 

  8. Mashechkin, I.V., Petrovskiy, and Tsarev, D.V., Methods of text fragment relevance estimation based on the topic model analysis in the text summarization problem, Vychislitel’nye metodi i programmirovanie, 2013, vol. 14, pp. 91–102.

    Google Scholar 

  9. Mashechkin, I.V., Petrovskiy, M.I., Popov, D.S., and Tsarev, D.V., Automatic text summarization using latent semantic analysis, Program. Comput. Software, 2011, vol. 37, no. 6, pp. 299–305.

    Article  MATH  MathSciNet  Google Scholar 

  10. Tsarev, D., Petrovskiy, M., and Mashechkin, I., Using NMF-based text summarization to improve supervised and unsupervised classification, Proc. of the 11th International Conference on Hybrid Intelligent Systems (HIS), Malacca, Malaysia, 2011, pp. 185–189.

    Google Scholar 

  11. Tsarev, D., Petrovskiy, M., and Mashechkin, I., Supervised and unsupervised text classification via generic summarization, Int. J. Comput. Information Systems Industrial Management Applications, 2013, vol. 5, pp. 509–515.

    Google Scholar 

  12. Manning, C.D., Prabhakar Raghavan, and Hinrich Schutze, Introduction to Information Retrieval, Cambridge: Cambridge Univ. Press, 2008.

    Chapter  Google Scholar 

  13. Andri Mirzal, Converged algorithms for orthogonal nonnegative matrix factorizations, CoRR abs/1010.5290, 2010.

    Google Scholar 

  14. Wei Xu, Xin Liu, and Yihong Gong, Document clustering based on non-negative matrix factorization, Proc. of the 26th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Toronto, 2003.

    Google Scholar 

  15. Chris Ding, Tao Li, Wei Peng, and Haesun Park, Orthogonal nonnegative matrix tri-factorizations for clustering, SIGKDD, 2006.

    Google Scholar 

  16. Yoo, J. and Choi, S., Orthogonal nonnegative matrix factorization: multiplicative updates on Stiefel manifolds, Intelligent Data Engineering and Automated Learning — IDEAL 2008, Lecture Notes in Computer Science, 2008, vol. 5326, pp. 140–147.

    Article  Google Scholar 

  17. Handbook on the time-series algorithms (Microsoft). http://msdn.microsoft.com/ru-ru/library/bb677216.aspx.

  18. Autoregressive Integrated Moving Average (ARIMA). http://www.machinelearning.ru/wiki/index.php?title=Autoregressive-Integrated-Moving-Average

  19. George E. P. Box, Gwilym M. Jenkins, Gregory C. Reinsel, and Box Jeninks, Time Series Analysis: Forecasting and Control.

  20. Meek, C., Chickering, D.M., and Heckerman, D., Autoregressive Tree Models for Time-Series Analysis, http://go.microsoft.com/fwlink/?LinkId=45966.

  21. Enron Email Dataset. http://www.cs.cmu.edu/~./enron/.

  22. Berry, M.W., Browne, M., Langville, A.N., Pauca, V.P., and Plemmons, R.J., Algorithms and applications for approximate nonnegative matrix factorization, Computational Statistics Data Analysis, 2007, vol. 52, no. 1, pp. 155–173.

    Article  MATH  MathSciNet  Google Scholar 

  23. Natural Language Toolkit (NLTK). http://www.nltk.org.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to I. V. Mashechkin.

Additional information

Original Russian Text © I.V. Mashechkin, M.I. Petrovskiy, D.S. Popov, D.V. Tsarev, 2015, published in Programmirovanie, 2015, Vol. 41, No. 1.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mashechkin, I.V., Petrovskiy, M.I., Popov, D.S. et al. Applying text mining methods for data loss prevention. Program Comput Soft 41, 23–30 (2015). https://doi.org/10.1134/S0361768815010041

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S0361768815010041

Keywords

Navigation