Data Loss Prevention Using Document Semantic Signature

  • Hanan AlhindiEmail author
  • Issa Traore
  • Isaac Woungang
Conference paper
Part of the Lecture Notes on Data Engineering and Communications Technologies book series (LNDECT, volume 27)


Data protection and insider threat detection and prevention are significant steps that organizations should take to enhance their internal security. Data loss prevention (DLP) is an emerging mechanism that is currently being used by organizations to detect and block unauthorized data transfers. Existing DLP approaches, however, face several practical challenges that limit their effectiveness. In this chapter, by extracting and analyzing document content semantic, we present a new DLP approach that addresses many existing challenges. We introduce the notion of a document semantic signature as a summarized representation of the document semantic. We show that the semantic signature can be used to detect a data leak by experimenting on a public dataset, yielding very encouraging detection effectiveness results including on average a false positive rate (FPR) of 6.71% and on average a detection rate (DR) of 84.47%.


Data loss prevention Document semantic Document semantic signature Insider threat detection Ontology 



Boyer Moore algorithm


Component-based software development


Concept vector file


Concept map


Cosine similarity


Concept tree


Document concept tree


Ontology description logics


Data loss prevention


Detection rate


Document semantic signature


False discovery rate


Financial Industry Business Ontology


False negative rate


False Positive rate


Inverse document frequency


Intrusion detection systems


Knowledge base


Kernel density estimation


Network-based intrusion detection system


National Threat Assessment Center


Ontology web language


Resource description framework


Relevancy nodes-based concept vector model


Sensitive information dissemination detection


Support vector machines


Smith–Waterman algorithm


Term frequency


Term frequency inverse document frequency


  1. 1.
    E. Kowalski, D. Cappelli, A. Moore, U.S. Secret Service and CERT/SEI Insider Threat Study: Illicit Cyber Activity in the Information Technology and Telecommunications Sector (Carnegie Mellon Software Engineering Institute, Pittsburgh, 2008)Google Scholar
  2. 2.
    D.L. Costa, M.L. Collins, S.J. Perl, et al., An Ontology for Insider Threat Indicators Development and Applications (Carnegie-Mellon University, Pittsburgh, Software Engineering Inst, 2014)Google Scholar
  3. 3.
    M. Kandias, A. Mylonas, N. Virvilis, et al., An insider threat prediction model, in International Conference on Trust, Privacy and Security in Digital Business, (Springer, Cham, 2010), pp. 26–37CrossRefGoogle Scholar
  4. 4.
    A.W. Udoeyop, Cyber Profiling for Insider Threat Detection. Master’s Thesis, University of Tennessee (2010)Google Scholar
  5. 5.
    Y. Liu, C. Corbett, K. Chiang, et al., SIDD: A framework for detecting sensitive data exfiltration by an insider attack. System Sciences, 2009. HICSS’09. 42nd Hawaii International Conference on IEEE 2009, pp. 1–10Google Scholar
  6. 6.
    H. Ragavan, Insider threat mitigation models based on thresholds and dependencies (University of Arkansas, Fayetteville, 2012)Google Scholar
  7. 7.
    P. Raman, H.G. Kayacık, A. Somayaji, Understanding data leak prevention, in 6th Annual Symposium on Information Assurance (ASIA’11) (2011), pp. 27–3Google Scholar
  8. 8.
    S. Liu, R. Kuhn, Data loss prevention. IT Professional 12(2), 10–13 (2010)CrossRefGoogle Scholar
  9. 9.
    M. Hart, P. Manadhata, R. Johnson, Text classification for data loss prevention, ed. by S. Fischer-Hübner, N. Hopper. PETS 2011. LNCS, vol. 6794 (2011), p 18–37Google Scholar
  10. 10.
    V. Stamati-Koromina, C. Ilioudis, R. Overill, et al., Insider threats in corporate environments: a case study for data leakage prevention, in Proceedings of the Fifth Balkan Conference in Informatics, (ACM, New York, 2012), pp. 271–274CrossRefGoogle Scholar
  11. 11.
    Y. Canbay, H. Yazici, S. Sagiroglu, A Turkish language based data leakage prevention system. in Digital Forensic and Security (ISDFS), 5th International Symposium (IEEE, April 2017), pp. 1–6Google Scholar
  12. 12.
    S. Vodithala, S. Pabboju, A keyword ontology for retrieval of software components. Int. J. Control Theory Appl. 10(19), 177–182 (2017)Google Scholar
  13. 13.
    M. Fernández, I. Cantador, V. López, et al., Semantically enhanced information retrieval: an ontology-based approach. Web Semant. Sci. Serv. Agents World Wide Web 9(4), 434–452 (2011)CrossRefGoogle Scholar
  14. 14.
    K. Doing-Harris, Y. Livnat, S. Meystre, Automated concept and relationship extraction for the semi-automated ontology management (SEAM) system. J. Biomed. Semant. 6, 15 (2015)CrossRefGoogle Scholar
  15. 15.
    H.Z. Liu, H. Bao, D. Xu, Concept vector for similarity measurement based on hierarchical domain structure. Comput. Inform. 30(5), 881–900 (2012)zbMATHGoogle Scholar
  16. 16.
    C. Corley, R. Mihalcea, Measuring the semantic similarity of texts. in Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Association for Computational Linguistics, 2003, p 13–18Google Scholar
  17. 17.
    Onix, Onix Text Retrieval Toolkit API Reference (2017),, Accessed 14 Nov 2017
  18. 18.
    B. Klimt, Y. Yang, The Enron Corpus: a new dataset for email classification research, in Machine learning, ECML 2004, (Springer, Berlin, 2004), pp. 217–226CrossRefGoogle Scholar
  19. 19.
    FIBO, Financial Industry Business Ontology (2017), Accessed 20 Oct 2017
  20. 20.
    Business Balls (2017), Accessed 19 Oct 2017
  21. 21.
    Enron Email Dataset (2017), Accessed 20 Oct 2017
  22. 22.
    A. Mahajan, S. Sharma, The malicious insiders threat in the cloud. Int. J. Eng. Res. Gen. Sci. 3(2), 245–256 (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Electrical and Computer Engineering DepartmentUniversity of VictoriaVictoriaCanada
  2. 2.Department of Computer ScienceRyerson UniversityTorontoCanada

Personalised recommendations