Automated Corpus Annotation for Cybersecurity Named Entity Recognition with Small Keyword Dictionary

Kashihara, Kazuaki; Sandhu, Harshdeep Singh; Shakarian, Jana

doi:10.1007/978-3-030-82199-9_11

Kazuaki Kashihara¹⁰,
Harshdeep Singh Sandhu¹¹ &
Jana Shakarian¹¹

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 296))

Included in the following conference series:

Proceedings of SAI Intelligent Systems Conference

1587 Accesses
1 Citations

Abstract

In order to assist security analysts in obtaining information pertaining to the cybersecurity tailored to the security domain are needed. Since labeled text data is scarce and expensive, Named Entity Recognition (NER) is used to detect the relevant domain entities from the raw text. To train a new NER model for cybersecurity, traditional NER requires a training corpus annotated with cybersecurity entities. Our previous work proposed a Human-Machine Interaction method for semi-automatic labeling and corpus generation for cybersecurity entities. This method requires small dictionary that has the pairs of keywords and their categories, and text data. However, the semantic similarity measurement in the method to solve the ambiguous keywords requires the specific category names even if non cybersecurity related categories. In this work, we introduce another semantic similarity measurement using text category classifier which does not require to give the specific non cybersecurity related category name. We compare the performance of the two semantic similarity measurements, and the new measurement performs better. The experimental evaluation result shows that our method with the training data that is annotated by small dictionary performs almost same performance of the models that are trained with fully annotated data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 229.00; Price excludes VAT (USA)

Softcover Book: USD 299.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, 20–26 August, 2018, pp. 1638–1649 (2018)
Google Scholar
Baevski, A., Edunov, S., Liu, Y., Zettlemoyer, L., Auli, M.: Cloze-driven pretraining of self-attention networks. CoRR, abs/1903.07785 (2019)
Google Scholar
Bridges, R.A., Jones, C.L., Iannacone, M.D., Goodall, J.R.: Automatic labeling for entity extraction in cyber security. CoRR, abs/1308.4941 (2013)
Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 999888, 2493–2537 (2011)
MATH Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
Google Scholar
Dong, Y., Guo, W., Chen, Y., Xing, X., Zhang, Y., Wang, G.: Towards the detection of inconsistencies in public security vulnerability reports. In: Heninger, N., Traynor, P., (eds.) 28th USENIX Security Symposium, USENIX Security 2019, Santa Clara, CA, USA, 14–16 August, 2019, pp. 869–885. USENIX Association (2019)
Google Scholar
Gasmi, H., Bouras, A., Laval, J.: Lstm recurrent neural networks for cybersecurity named entity recognition. In: ICSEA 2018, p. 11 (2018)
Google Scholar
Gasmi, H., Bouras, A., Laval, J.: LSTM recurrent neural networks for cybersecurity named entity recognition. ICSEA 11, 2018 (2018)
Google Scholar
Graves, A., Mohamed, A., Hinton, G.E.: Speech recognition with deep recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, 26–31 May, 2013, pp. 6645–6649. IEEE (2013)
Google Scholar
Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991 (2015)
Google Scholar
Jones, C.L., Bridges, R.A., Huffer, K.M.T., Goodall, J.R.: Towards a relation extraction framework for cyber-security concepts. In Proceedings of the 10th Annual Cyber and Information Security Research Conference, CISR 2015, Oak Ridge, TN, USA, 7–9 April, 2015, pp. 11:1–11:4 (2015)
Google Scholar
Kashihara, K., Shakarian, J., Baral, C.: Human-Machine Interaction for Improved Cybersecurity Named Entity Recognition Considering Semantic Similarity. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1251, pp. 347–361. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55187-2_28
Chapter Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12–17, 2016, pp. 260–270 (2016)
Google Scholar
Ma, X., Hovy, E.H.: End-to-end sequence labeling via bi-directional lstm-cnns-crf. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, 7–12 August, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics (2016)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5–8, 2013, Lake Tahoe, Nevada, United States, pp. 3111–3119 (2013)
Google Scholar
Mulwad, V., Li, W., Joshi, A., Finin, T., Viswanathan, K.: Extracting information about security vulnerabilities from web text. In: Proceedings of the 2011 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Workshops, WI-IAT 2011, Campus Scientifique de la Doua, Lyon, France, 22–27 August, 2011, pp. 257–260 (2011)
Google Scholar
Nguyen, T.H., Grishman, R.: Event detection and domain adaptation with convolutional neural networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26–31, 2015, Beijing, China, Volume 2: Short Papers, pp. 365–371 (2015)
Google Scholar
Vinayakumar R., Alazab, M., Jolfaei, A., Soman, K.P., Poornachandran, P.: Ransomware triage using deep learning: Twitter as a case study. In Cybersecurity and Cyberforensics Conference, CCC 2019, Melbourne, Australia, 8–9 May, 2019, pp. 67–73. IEEE (2019)
Google Scholar
Vinayakumar, R., Alazab, M., Srinivasan, S., Pham, Q.-V., Padannayil, S., Ketha, S.: A visualized botnet detection system based deep learning for the internet of things networks of smart cities. IEEE Trans. Ind. Appl. 1–1, January 2020
Google Scholar
Tjong Kim Sang, E.F., Buchholz, S.: Introduction to the conll-2000 shared task chunking. In: Fourth Conference on Computational Natural Language Learning, CoNLL 2000, and the Second Learning Language in Logic Workshop, LLL 2000, Held in cooperation with ICGI-2000, Lisbon, Portugal, September 13–14, 2000, pp. 127–132 (2000)
Google Scholar
Satyapanich, T., Ferraro, F., Finin, T.: CASIE: extracting cybersecurity event information from text. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020, pp. 8749–8757. AAAI Press (2020)
Google Scholar
Simran, K., Sriram, S., Vinayakumar, R., Soman, K.P.: Deep learning approach for intelligent named entity recognition of cyber security. In: International Symposium on Signal Processing and Intelligent Recognition Systems, pp. 163–172. Springer (2019)
Google Scholar
Sirotina, A., Loukachevitch, N.: Named entity recognition in information security domain for Russian. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pp. 1114–1120, Varna, Bulgaria, September 2019. INCOMA Ltd (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4–9 December 2017, Long Beach, CA, USA, pp. 5998–6008 (2017)
Google Scholar
Vinayakumar, R., Alazab, M., Soman, K.P., Poornachandran, P., Venkatraman, S.: Robust intelligent malware detection using deep learning. IEEE Access 7, 46717–46738 (2019)
Article Google Scholar
Vinayakumar, R., Soman, K.P., Poornachandran, P.: Detecting malicious domain names using deep learning approaches at scale. J. Intell. Fuzzy Syst. 34(3), 1355–1367 (2018)
Article Google Scholar
Vinayakumar, R., Soman, K.P., Poornachandran, P.: Evaluating deep learning approaches to characterize and classify malicious url’s. J. Intell. Fuzzy Syst. 34(3), 1333–1343 (2018)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Arizona State University, Tempe, AZ, 85281, USA
Kazuaki Kashihara
Cyber Reconnaissance, Inc., Tempe, AZ, 85281, USA
Harshdeep Singh Sandhu & Jana Shakarian

Authors

Kazuaki Kashihara
View author publications
You can also search for this author in PubMed Google Scholar
Harshdeep Singh Sandhu
View author publications
You can also search for this author in PubMed Google Scholar
Jana Shakarian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kazuaki Kashihara .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Saga University, Saga, Japan
Kohei Arai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kashihara, K., Sandhu, H.S., Shakarian, J. (2022). Automated Corpus Annotation for Cybersecurity Named Entity Recognition with Small Keyword Dictionary. In: Arai, K. (eds) Intelligent Systems and Applications. IntelliSys 2021. Lecture Notes in Networks and Systems, vol 296. Springer, Cham. https://doi.org/10.1007/978-3-030-82199-9_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-82199-9_11
Published: 07 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82198-2
Online ISBN: 978-3-030-82199-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics