Skip to main content

Automated Corpus Annotation for Cybersecurity Named Entity Recognition with Small Keyword Dictionary

  • Conference paper
  • First Online:
Intelligent Systems and Applications (IntelliSys 2021)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 296))

Included in the following conference series:

Abstract

In order to assist security analysts in obtaining information pertaining to the cybersecurity tailored to the security domain are needed. Since labeled text data is scarce and expensive, Named Entity Recognition (NER) is used to detect the relevant domain entities from the raw text. To train a new NER model for cybersecurity, traditional NER requires a training corpus annotated with cybersecurity entities. Our previous work proposed a Human-Machine Interaction method for semi-automatic labeling and corpus generation for cybersecurity entities. This method requires small dictionary that has the pairs of keywords and their categories, and text data. However, the semantic similarity measurement in the method to solve the ambiguous keywords requires the specific category names even if non cybersecurity related categories. In this work, we introduce another semantic similarity measurement using text category classifier which does not require to give the specific non cybersecurity related category name. We compare the performance of the two semantic similarity measurements, and the new measurement performs better. The experimental evaluation result shows that our method with the training data that is annotated by small dictionary performs almost same performance of the models that are trained with fully annotated data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 229.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 299.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, 20–26 August, 2018, pp. 1638–1649 (2018)

    Google Scholar 

  2. Baevski, A., Edunov, S., Liu, Y., Zettlemoyer, L., Auli, M.: Cloze-driven pretraining of self-attention networks. CoRR, abs/1903.07785 (2019)

    Google Scholar 

  3. Bridges, R.A., Jones, C.L., Iannacone, M.D., Goodall, J.R.: Automatic labeling for entity extraction in cyber security. CoRR, abs/1308.4941 (2013)

    Google Scholar 

  4. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 999888, 2493–2537 (2011)

    MATH  Google Scholar 

  5. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)

    Google Scholar 

  6. Dong, Y., Guo, W., Chen, Y., Xing, X., Zhang, Y., Wang, G.: Towards the detection of inconsistencies in public security vulnerability reports. In: Heninger, N., Traynor, P., (eds.) 28th USENIX Security Symposium, USENIX Security 2019, Santa Clara, CA, USA, 14–16 August, 2019, pp. 869–885. USENIX Association (2019)

    Google Scholar 

  7. Gasmi, H., Bouras, A., Laval, J.: Lstm recurrent neural networks for cybersecurity named entity recognition. In: ICSEA 2018, p. 11 (2018)

    Google Scholar 

  8. Gasmi, H., Bouras, A., Laval, J.: LSTM recurrent neural networks for cybersecurity named entity recognition. ICSEA 11, 2018 (2018)

    Google Scholar 

  9. Graves, A., Mohamed, A., Hinton, G.E.: Speech recognition with deep recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, 26–31 May, 2013, pp. 6645–6649. IEEE (2013)

    Google Scholar 

  10. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991 (2015)

    Google Scholar 

  11. Jones, C.L., Bridges, R.A., Huffer, K.M.T., Goodall, J.R.: Towards a relation extraction framework for cyber-security concepts. In Proceedings of the 10th Annual Cyber and Information Security Research Conference, CISR 2015, Oak Ridge, TN, USA, 7–9 April, 2015, pp. 11:1–11:4 (2015)

    Google Scholar 

  12. Kashihara, K., Shakarian, J., Baral, C.: Human-Machine Interaction for Improved Cybersecurity Named Entity Recognition Considering Semantic Similarity. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1251, pp. 347–361. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55187-2_28

    Chapter  Google Scholar 

  13. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12–17, 2016, pp. 260–270 (2016)

    Google Scholar 

  14. Ma, X., Hovy, E.H.: End-to-end sequence labeling via bi-directional lstm-cnns-crf. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, 7–12 August, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics (2016)

    Google Scholar 

  15. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5–8, 2013, Lake Tahoe, Nevada, United States, pp. 3111–3119 (2013)

    Google Scholar 

  16. Mulwad, V., Li, W., Joshi, A., Finin, T., Viswanathan, K.: Extracting information about security vulnerabilities from web text. In: Proceedings of the 2011 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Workshops, WI-IAT 2011, Campus Scientifique de la Doua, Lyon, France, 22–27 August, 2011, pp. 257–260 (2011)

    Google Scholar 

  17. Nguyen, T.H., Grishman, R.: Event detection and domain adaptation with convolutional neural networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26–31, 2015, Beijing, China, Volume 2: Short Papers, pp. 365–371 (2015)

    Google Scholar 

  18. Vinayakumar R., Alazab, M., Jolfaei, A., Soman, K.P., Poornachandran, P.: Ransomware triage using deep learning: Twitter as a case study. In Cybersecurity and Cyberforensics Conference, CCC 2019, Melbourne, Australia, 8–9 May, 2019, pp. 67–73. IEEE (2019)

    Google Scholar 

  19. Vinayakumar, R., Alazab, M., Srinivasan, S., Pham, Q.-V., Padannayil, S., Ketha, S.: A visualized botnet detection system based deep learning for the internet of things networks of smart cities. IEEE Trans. Ind. Appl. 1–1, January 2020

    Google Scholar 

  20. Tjong Kim Sang, E.F., Buchholz, S.: Introduction to the conll-2000 shared task chunking. In: Fourth Conference on Computational Natural Language Learning, CoNLL 2000, and the Second Learning Language in Logic Workshop, LLL 2000, Held in cooperation with ICGI-2000, Lisbon, Portugal, September 13–14, 2000, pp. 127–132 (2000)

    Google Scholar 

  21. Satyapanich, T., Ferraro, F., Finin, T.: CASIE: extracting cybersecurity event information from text. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020, pp. 8749–8757. AAAI Press (2020)

    Google Scholar 

  22. Simran, K., Sriram, S., Vinayakumar, R., Soman, K.P.: Deep learning approach for intelligent named entity recognition of cyber security. In: International Symposium on Signal Processing and Intelligent Recognition Systems, pp. 163–172. Springer (2019)

    Google Scholar 

  23. Sirotina, A., Loukachevitch, N.: Named entity recognition in information security domain for Russian. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pp. 1114–1120, Varna, Bulgaria, September 2019. INCOMA Ltd (2019)

    Google Scholar 

  24. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4–9 December 2017, Long Beach, CA, USA, pp. 5998–6008 (2017)

    Google Scholar 

  25. Vinayakumar, R., Alazab, M., Soman, K.P., Poornachandran, P., Venkatraman, S.: Robust intelligent malware detection using deep learning. IEEE Access 7, 46717–46738 (2019)

    Article  Google Scholar 

  26. Vinayakumar, R., Soman, K.P., Poornachandran, P.: Detecting malicious domain names using deep learning approaches at scale. J. Intell. Fuzzy Syst. 34(3), 1355–1367 (2018)

    Article  Google Scholar 

  27. Vinayakumar, R., Soman, K.P., Poornachandran, P.: Evaluating deep learning approaches to characterize and classify malicious url’s. J. Intell. Fuzzy Syst. 34(3), 1333–1343 (2018)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kazuaki Kashihara .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kashihara, K., Sandhu, H.S., Shakarian, J. (2022). Automated Corpus Annotation for Cybersecurity Named Entity Recognition with Small Keyword Dictionary. In: Arai, K. (eds) Intelligent Systems and Applications. IntelliSys 2021. Lecture Notes in Networks and Systems, vol 296. Springer, Cham. https://doi.org/10.1007/978-3-030-82199-9_11

Download citation

Publish with us

Policies and ethics