Skip to main content

SecureBERT: A Domain-Specific Language Model for Cybersecurity

Part of the Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering book series (LNICST,volume 462)

Abstract

Natural Language Processing (NLP) has recently gained wide attention in cybersecurity, particularly in Cyber Threat Intelligence (CTI) and cyber automation. Increased connection and automation have revolutionized the world’s economic and cultural infrastructures, while they have introduced risks in terms of cyber attacks. CTI is information that helps cybersecurity analysts make intelligent security decisions, that is often delivered in the form of natural language text, which must be transformed to machine readable format through an automated procedure before it can be used for automated security measures.

This paper proposes SecureBERT, a cybersecurity language model capable of capturing text connotations in cybersecurity text (e.g., CTI) and therefore successful in automation for many critical cybersecurity tasks that would otherwise rely on human expertise and time-consuming manual efforts. SecureBERT has been trained using a large corpus of cybersecurity text. To make SecureBERT effective not just in retaining general English understanding, but also when applied to text with cybersecurity implications, we developed a customized tokenizer as well as a method to alter pre-trained weights. The SecureBERT is evaluated using the standard Masked Language Model (MLM) test as well as two additional standard NLP tasks. Our evaluation studies show that SecureBERT outperforms existing similar models, confirming its capability for solving crucial NLP tasks in cybersecurity.

Keywords

  • Cyber automation
  • Cyber threat intelligence
  • Language model

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Sample data: https://dropbox.com/sh/jg45zvfl7iek12i/AAB7bFghED9GmkO5YxpPLIuma?dl=0.

  2. 2.

    https://spacy.io/usage.

  3. 3.

    https://www.kaggle.com/c/movie-review-sentiment-analysis-kernels-only.

  4. 4.

    https://github.com/jackaduma/SecBERT.

  5. 5.

    https://github.com/kbandla/APTnotes.

  6. 6.

    https://stucco.github.io/data/.

  7. 7.

    https://ebiquity.umbc.edu/_file_directory_/papers/943.pdf.

References

  1. Aghaei, E., Al-Shaer, E.: Threatzoom: neural network for automated vulnerability mitigation. In: Proceedings of the 6th Annual Symposium on Hot Topics in the Science of Security, pp. 1–3 (2019)

    Google Scholar 

  2. Aghaei, E., Serpen, G.: Host-based anomaly detection using eigentraces feature extraction and one-class classification on system call trace data. J. Inf. Assurance Sec. (JIAS) 14(4), 106–117 (2019)

    Google Scholar 

  3. Aghaei, E., Shadid, W., Al-Shaer, E.: ThreatZoom: hierarchical neural network for CVEs to CWEs classification. In: Park, N., Sun, K., Foresti, S., Butler, K., Saxena, N. (eds.) SecureComm 2020. LNICST, vol. 335, pp. 23–41. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-63086-7_2

    CrossRef  Google Scholar 

  4. Ahn, H., Cha, S., Lee, D., Moon, T.: Uncertainty-based continual learning with adaptive regularization. In: Advances in Neural Information Processing Systems 32 (2019)

    Google Scholar 

  5. Alsentzer, E., et al.: Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323 (2019)

  6. Ameri, K., Hempel, M., Sharif, H., Lopez, J., Jr., Perumalla, K.: Cybert: Cybersecurity claim classification by fine-tuning the bert language model. J. Cybersec. Privacy 1(4), 615–637 (2021)

    CrossRef  Google Scholar 

  7. Beltagy, I., Lo, K., Cohan, A.: Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676 (2019)

  8. Bishop, C.M.: Training with noise is equivalent to tikhonov regularization. Neural Compu. 7(1), 108–116 (1995). https://doi.org/10.1162/neco.1995.7.1.108

  9. Chen, Y., Ding, J., Li, D., Chen, Z.: Joint bert model based cybersecurity named entity recognition. In: 2021 The 4th International Conference on Software Engineering and Information Management, pp. 236–242 (2021)

    Google Scholar 

  10. Dalton, A., et al.: Active defense against social engineering: The case for human language technology. In: Proceedings for the First International Workshop on Social Threats in Online Conversations: Understanding and Management, pp. 1–8 (2020)

    Google Scholar 

  11. Das, S.S., Serra, E., Halappanavar, M., Pothen, A., Al-Shaer, E.: V2w-bert: A framework for effective hierarchical multiclass classification of software vulnerabilities. In: 2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA), pp. 1–12. IEEE (2021)

    Google Scholar 

  12. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  13. Gao, C., Zhang, X., Liu, H.: Data and knowledge-driven named entity recognition for cyber security. Cybersecurity 4(1), 1–13 (2021). https://doi.org/10.1186/s42400-021-00072-y

    CrossRef  Google Scholar 

  14. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference On Artificial Intelligence And Statistics, pp. 249–256. JMLR Workshop and Conference Proceedings (2010)

    Google Scholar 

  15. Lee, J., et al.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2020)

    CrossRef  Google Scholar 

  16. Li, X., Yang, Z., Guo, P., Cheng, J.: An intelligent transient stability assessment framework with continual learning ability. IEEE Trans. Industr. Inf. 17(12), 8131–8141 (2021)

    CrossRef  Google Scholar 

  17. Lim, S.K., Muis, A.O., Lu, W., Ong, C.H.: MalwareTextDB: A database for annotated malware articles. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1557–1567. Association for Computational Linguistics, Vancouver, Canada (July 2017). https://doi.org/10.18653/v1/P17-1143, https://aclanthology.org/P17-1143

  18. Liu, X., Cheng, M., Zhang, H., Hsieh, C.-J.: Towards robust neural networks via random self-ensemble. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 381–397. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_23

    CrossRef  Google Scholar 

  19. Liu, Y., et al.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  20. Peters, M.E., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018)

  21. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  22. Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)

    Google Scholar 

  23. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)

  24. Sajid, M.S.I., Wei, J., Alam, M.R., Aghaei, E., Al-Shaer, E.: Dodgetron: Towards autonomous cyber deception using dynamic hybrid analysis of malware. In: 2020 IEEE Conference on Communications and Network Security (CNS), pp. 1–9. IEEE (2020)

    Google Scholar 

  25. Shibata, Y., et al.: Byte Pair Encoding: A Text Compression Scheme That Accelerates Pattern Matching (1999)

    Google Scholar 

  26. Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference On Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013)

    Google Scholar 

  27. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  28. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018)

  29. Wang, C., Cho, K., Gu, J.: Neural machine translation with byte-level subwords. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 9154–9160 (2020)

    Google Scholar 

  30. Yin, J., Tang, M., Cao, J., Wang, H.: Apply transfer learning to cybersecurity: Predicting exploitability of vulnerabilities by description. Knowl.-Based Syst. 210, 106529 (2020)

    CrossRef  Google Scholar 

  31. You, Z., Ye, J., Li, K., Xu, Z., Wang, P.: Adversarial noise layer: Regularize neural network by adding noise. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 909–913. IEEE (2019)

    Google Scholar 

  32. Zhou, S., Liu, J., Zhong, X., Zhao, W.: Named entity recognition using bert with whole world masking in cybersecurity domain. In: 2021 IEEE 6th International Conference on Big Data Analytics (ICBDA), pp. 316–320. IEEE (2021)

    Google Scholar 

  33. Zur, R.M., Jiang, Y., Pesce, L.L., Drukker, K.: Noise injection for training artificial neural networks: A comparison with weight decay and early stopping. Med. Phys. 36(10), 4810–4818 (2009)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ehsan Aghaei .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2023 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Aghaei, E., Niu, X., Shadid, W., Al-Shaer, E. (2023). SecureBERT: A Domain-Specific Language Model for Cybersecurity. In: Li, F., Liang, K., Lin, Z., Katsikas, S.K. (eds) Security and Privacy in Communication Networks. SecureComm 2022. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 462. Springer, Cham. https://doi.org/10.1007/978-3-031-25538-0_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-25538-0_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-25537-3

  • Online ISBN: 978-3-031-25538-0

  • eBook Packages: Computer ScienceComputer Science (R0)