Abstract
In this research, we aim to expand the utility of keyword filtering on text-based data in the domain of cyber threat intelligence. Existing research-based cyber threat intelligence systems and production systems often utilize keyword filtering as a method to obtain training data for a classification model or as a classifier in itself. This method is known to have concerns with false-positives that affect data quality and thus can produce downstream issues for security analysts that utilize these types of systems. We propose a method to classify open-source intelligence data into a cybersecurity-related information stream and subsequently increase the quality of that stream using an unsupervised clustering method. Our method expands on keyword filtering techniques by introducing a word2vec generated associated words list which assists in the classification of ambiguous posts to reduce false-positives while still retrieving large scope data. We then use k-means clustering on positively classified entries to identify and remove clusters that are not relevant to threats. We further explore this method by investigating the effects of using segmentation based on data characteristics to achieve better classification. Together these methods are able to create a higher quality cyber threat-related data stream that can be applied to existing text-based threat intelligence systems that use keyword filtering methods.
This is a preview of subscription content, access via your institution.



References
(2013) Selenium documentation¶. https://www.selenium.dev/selenium/docs/api/py/api.html
(2018) ”cisco 2018 annual cybersecurity report”. Technical report, ”Cisco Systems”, https://www.cisco.com/c/dam/m/hu_hu/campaigns/security-hub/pdf/acr-2018.pdf
(2019) ”the impact of security alert overload”. Technical report, ”CriticalStart”, https://www.criticalstart.com/wp-content/uploads/CS_MDR_Survey_Report.pdf
(2020) ”the economics of security operations centers: What is the true cost for effective results?”. Technical report, ”Ponemon Institute LLC sponsored by Respond Software”
Alves F, Bettini A, Ferreira PM, Bessani A (2019) Processing tweets for cybersecurity threat awareness. arXiv preprint arXiv:190402072
Baumgartner J, Zannettou S, Keegan B, Squire M, Blackburn J (2020) The pushshift reddit dataset. arXiv preprint arXiv:200108435
Behzadan V, Aguirre C, Bose A, Hsu W (2018) Corpus and deep learning classifier for collection of cyber threat indicators in twitter stream. In: 2018 IEEE international conference on big data (Big Data), IEEE, pp 5002–5007
Botes F, Leenen L, De La Harpe R (2017) Ant colony induced decision trees for intrusion detection. In: 16th European conference on cyber warfare and security, ACPI, pp 53–62
Caragea C, Silvescu A, Tapia AH (2016) Identifying informative messages in disaster events using convolutional neural networks. In: International conference on information systems for crisis response and management, pp 137–147
Concone F, De Paola A, Re GL, Morana M (2017) Twitter analysis for real-time malware discovery. In: 2017 AEIT international annual conference, IEEE, pp 1–6
Dionísio N, Alves F, Ferreira PM, Bessani A (2019) Cyberthreat detection from twitter using deep neural networks. arXiv preprint arXiv:190401127
ESET (2020) Welivesecurity. https://www.welivesecurity.com/
Exchange S (2019) The stack exchange data explorer. Online, http://datastackexchangecom/ Accessed September
Fink GA, North CL, Endert A, Rose S (2009) Visualizing cyber security: Usable workspaces. In: 2009 6th international workshop on visualization for cyber security, IEEE, pp 45–56
Hariharan A, Gupta A, Pal T (2020) Camlpad: Cybersecurity autonomous machine learning platform for anomaly detection. In: Future of information and communication conference, Springer, pp 705–720
Horawalavithana S, Bhattacharjee A, Liu R, Choudhury N, O Hall L, Iamnitchi A (2019) Mentions of security vulnerabilities on reddit, twitter and github. In: IEEE/WIC/ACM international conference on web intelligence, pp 200–207
Kaggle (2019) All the news. Online, https://wwwkagglecom/snapcrack/all-the-news Accessed September
Khatua A, Khatua A, Cambria E (2019) A tale of two epidemics: Contextual word2vec for classifying twitter streams during outbreaks. Inf Process Manag 56(1):247–257
Khurana N, Mittal S, Piplai A, Joshi A (2019) Preventing poisoning attacks on ai based threat intelligence systems. In: 2019 IEEE 29th international workshop on machine learning for signal processing (MLSP), IEEE, pp 1–6
Le BD, Wang G, Nasim M, Babar MA (2019) Gathering cyber threat intelligence from twitter using novelty classification. In: 2019 International conference on cyberworlds (CW), IEEE, pp 316–323
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196
Le Sceller Q, Karbab EB, Debbabi M, Iqbal F (2017) Sonar: Automatic detection of cyber security events over the twitter stream. In: Proceedings of the 12th international conference on availability, Reliability and Security, ACM, p 23
Lee KC, Hsieh CH, Wei LJ, Mao CH, Dai JH, Kuang YT (2017) Sec-buzzer: cyber security emerging topic mining with open threat intelligence retrieval and timeline event annotation. Soft Comput 21(11):2883–2896
Liu Y, Yao X (1999) Ensemble learning via negative correlation. Neural Netw 12(10):1399–1404
Mendsaikhan O, Hasegawa H, Yamaguchi Y, Shimada H (2019) Identification of cybersecurity specific content using the doc2vec language model. In: 2019 IEEE 43rd annual computer software and applications conference (COMPSAC), vol 1, pp 396–401
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems. Curran Associates, Inc., pp 3111–3119. PMID: 903
Miller ST, Busby-Earle C (2017) Multi-perspective machine learning a classifier ensemble method for intrusion detection. In: Proceedings of the 2017 international conference on machine learning and soft computing, pp 7–12
Mittal S, Das PK, Mulwad V, Joshi A, Finin T (2016) Cybertwitter: Using twitter to generate alerts for cybersecurity threats and vulnerabilities. In: Proceedings of the 2016 IEEE/ACM international conference on advances in social networks analysis and Mining, IEEE Press, pp 860–867
Moustafa N, Slay J (2015) Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 military communications and information systems conference (MilCIS), IEEE, pp 1–6
News TH (2019) Cybersecurity news and analysis. https://thehackernews.com/
Nunes E, Diab A, Gunn A, Marin E, Mishra V, Paliath V, Robertson J, Shakarian J, Thart A, Shakarian P (2016) Darknet and deepnet mining for proactive cybersecurity threat intelligence. In: 2016 IEEE conference on intelligence and security informatics (ISI), IEEE, pp 7–12
Palshikar GK, Apte M, Pandita D (2017) Weakly supervised classification of tweets for disaster management. In: SMERP@ ECIR, pp 4–13
Rao A, Spasojevic N (2016) Actionable and political text classification using word embeddings and lstm. arXiv preprint arXiv:160702501
Rehurek R, Sojka P (2011) Gensim—-statistical semantics in python. statistical semantics; gensim; Python; LDA; SVD
Rodriguez A, Okamura K (2020) Cybersecurity text data classification and optimization for cti systems. In: Workshops of the international conference on advanced information networking and applications, Springer, pp 410–419
Roesslein J (2009) tweepy documentation. Online] http://tweepy.readthedocs.io/en/v3.5
Samtani S, Chinn R, Chen H, Nunamaker JF Jr (2017) Exploring emerging hacker assets and key hackers for proactive cyber threat intelligence. J Manag Inf Syst 34(4):1023–1053
Shin HS, Kwon HY, Ryu SJ (2020) A new text classification model based on contrastive word embedding for detecting cybersecurity intelligence in twitter. Electronics 9(9):1527
Shrestha Chitrakar A, Petrović S (2019) Efficient k-means using triangle inequality on spark for cyber security analytics. In: Proceedings of the ACM international workshop on security and privacy analytics, pp 37–45
Tripathy B, Thakur S, Chowdhury R (2017) A classification model to analyze the spread and emerging trends of the zika virus in twitter. In: Behera H, Mohapatra D (eds) Advances in intelligent systems and computing, 1st edn, chap 61. Springer Nature Singapore Pte Ltd., pp 643–650
Vasudevan A, Harshini E, Selvakumar S (2011) Ssenet-2011: a network intrusion detection system dataset and its comparison with kdd cup 99 dataset. In: 2011 second asian himalayas international conference on internet (AH-ICI), IEEE, pp 1–5
Zhang F, Stromer-Galley J, Tanupabrungsun S, Hegde Y, McCracken N, Hemsley J (2017) Understanding discourse acts: Political campaign messages classification on facebook and twitter. In: International conference on social computing. Springer, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation, pp 242–247
Zhang J, Chen X, Xiang Y, Zhou W, Wu J (2015) Robust network traffic classification. IEEE/ACM Trans Netw 23(4):1257–1270
Zhou Y, Wang P (2019) An ensemble learning approach for xss attack detection with domain knowledge and threat intelligence. Comput Secur 82:261–269
Zhou Y, Cheng G, Jiang S, Dai M (2019a) An efficient intrusion detection system based on feature selection and ensemble classifier. arXiv preprint arXiv:190401352
Zong S, Ritter A, Mueller G, Wright E (2019b) Analyzing the perceived severity of cybersecurity threats reported on social media. arXiv preprint arXiv:190210680
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Rodriguez, A., Okamura, K. Enhancing data quality in real-time threat intelligence systems using machine learning. Soc. Netw. Anal. Min. 10, 91 (2020). https://doi.org/10.1007/s13278-020-00707-x
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-020-00707-x