Skip to main content

Enhancing data quality in real-time threat intelligence systems using machine learning


In this research, we aim to expand the utility of keyword filtering on text-based data in the domain of cyber threat intelligence. Existing research-based cyber threat intelligence systems and production systems often utilize keyword filtering as a method to obtain training data for a classification model or as a classifier in itself. This method is known to have concerns with false-positives that affect data quality and thus can produce downstream issues for security analysts that utilize these types of systems. We propose a method to classify open-source intelligence data into a cybersecurity-related information stream and subsequently increase the quality of that stream using an unsupervised clustering method. Our method expands on keyword filtering techniques by introducing a word2vec generated associated words list which assists in the classification of ambiguous posts to reduce false-positives while still retrieving large scope data. We then use k-means clustering on positively classified entries to identify and remove clusters that are not relevant to threats. We further explore this method by investigating the effects of using segmentation based on data characteristics to achieve better classification. Together these methods are able to create a higher quality cyber threat-related data stream that can be applied to existing text-based threat intelligence systems that use keyword filtering methods.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3









  • (2013) Selenium documentation¶.

  • (2018) ”cisco 2018 annual cybersecurity report”. Technical report, ”Cisco Systems”,

  • (2019) ”the impact of security alert overload”. Technical report, ”CriticalStart”,

  • (2020) ”the economics of security operations centers: What is the true cost for effective results?”. Technical report, ”Ponemon Institute LLC sponsored by Respond Software”

  • Alves F, Bettini A, Ferreira PM, Bessani A (2019) Processing tweets for cybersecurity threat awareness. arXiv preprint arXiv:190402072

  • Baumgartner J, Zannettou S, Keegan B, Squire M, Blackburn J (2020) The pushshift reddit dataset. arXiv preprint arXiv:200108435

  • Behzadan V, Aguirre C, Bose A, Hsu W (2018) Corpus and deep learning classifier for collection of cyber threat indicators in twitter stream. In: 2018 IEEE international conference on big data (Big Data), IEEE, pp 5002–5007

  • Botes F, Leenen L, De La Harpe R (2017) Ant colony induced decision trees for intrusion detection. In: 16th European conference on cyber warfare and security, ACPI, pp 53–62

  • Caragea C, Silvescu A, Tapia AH (2016) Identifying informative messages in disaster events using convolutional neural networks. In: International conference on information systems for crisis response and management, pp 137–147

  • Concone F, De Paola A, Re GL, Morana M (2017) Twitter analysis for real-time malware discovery. In: 2017 AEIT international annual conference, IEEE, pp 1–6

  • Dionísio N, Alves F, Ferreira PM, Bessani A (2019) Cyberthreat detection from twitter using deep neural networks. arXiv preprint arXiv:190401127

  • ESET (2020) Welivesecurity.

  • Exchange S (2019) The stack exchange data explorer. Online, http://datastackexchangecom/ Accessed September

  • Fink GA, North CL, Endert A, Rose S (2009) Visualizing cyber security: Usable workspaces. In: 2009 6th international workshop on visualization for cyber security, IEEE, pp 45–56

  • Hariharan A, Gupta A, Pal T (2020) Camlpad: Cybersecurity autonomous machine learning platform for anomaly detection. In: Future of information and communication conference, Springer, pp 705–720

  • Horawalavithana S, Bhattacharjee A, Liu R, Choudhury N, O Hall L, Iamnitchi A (2019) Mentions of security vulnerabilities on reddit, twitter and github. In: IEEE/WIC/ACM international conference on web intelligence, pp 200–207

  • Kaggle (2019) All the news. Online, https://wwwkagglecom/snapcrack/all-the-news Accessed September

  • Khatua A, Khatua A, Cambria E (2019) A tale of two epidemics: Contextual word2vec for classifying twitter streams during outbreaks. Inf Process Manag 56(1):247–257

    Article  Google Scholar 

  • Khurana N, Mittal S, Piplai A, Joshi A (2019) Preventing poisoning attacks on ai based threat intelligence systems. In: 2019 IEEE 29th international workshop on machine learning for signal processing (MLSP), IEEE, pp 1–6

  • Le BD, Wang G, Nasim M, Babar MA (2019) Gathering cyber threat intelligence from twitter using novelty classification. In: 2019 International conference on cyberworlds (CW), IEEE, pp 316–323

  • Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196

  • Le Sceller Q, Karbab EB, Debbabi M, Iqbal F (2017) Sonar: Automatic detection of cyber security events over the twitter stream. In: Proceedings of the 12th international conference on availability, Reliability and Security, ACM, p 23

  • Lee KC, Hsieh CH, Wei LJ, Mao CH, Dai JH, Kuang YT (2017) Sec-buzzer: cyber security emerging topic mining with open threat intelligence retrieval and timeline event annotation. Soft Comput 21(11):2883–2896

    Article  Google Scholar 

  • Liu Y, Yao X (1999) Ensemble learning via negative correlation. Neural Netw 12(10):1399–1404

    Article  Google Scholar 

  • Mendsaikhan O, Hasegawa H, Yamaguchi Y, Shimada H (2019) Identification of cybersecurity specific content using the doc2vec language model. In: 2019 IEEE 43rd annual computer software and applications conference (COMPSAC), vol 1, pp 396–401

  • Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781

  • Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems. Curran Associates, Inc., pp 3111–3119. PMID: 903

  • Miller ST, Busby-Earle C (2017) Multi-perspective machine learning a classifier ensemble method for intrusion detection. In: Proceedings of the 2017 international conference on machine learning and soft computing, pp 7–12

  • Mittal S, Das PK, Mulwad V, Joshi A, Finin T (2016) Cybertwitter: Using twitter to generate alerts for cybersecurity threats and vulnerabilities. In: Proceedings of the 2016 IEEE/ACM international conference on advances in social networks analysis and Mining, IEEE Press, pp 860–867

  • Moustafa N, Slay J (2015) Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 military communications and information systems conference (MilCIS), IEEE, pp 1–6

  • News TH (2019) Cybersecurity news and analysis.

  • Nunes E, Diab A, Gunn A, Marin E, Mishra V, Paliath V, Robertson J, Shakarian J, Thart A, Shakarian P (2016) Darknet and deepnet mining for proactive cybersecurity threat intelligence. In: 2016 IEEE conference on intelligence and security informatics (ISI), IEEE, pp 7–12

  • Palshikar GK, Apte M, Pandita D (2017) Weakly supervised classification of tweets for disaster management. In: SMERP@ ECIR, pp 4–13

  • Rao A, Spasojevic N (2016) Actionable and political text classification using word embeddings and lstm. arXiv preprint arXiv:160702501

  • Rehurek R, Sojka P (2011) Gensim—-statistical semantics in python. statistical semantics; gensim; Python; LDA; SVD

  • Rodriguez A, Okamura K (2020) Cybersecurity text data classification and optimization for cti systems. In: Workshops of the international conference on advanced information networking and applications, Springer, pp 410–419

  • Roesslein J (2009) tweepy documentation. Online]

  • Samtani S, Chinn R, Chen H, Nunamaker JF Jr (2017) Exploring emerging hacker assets and key hackers for proactive cyber threat intelligence. J Manag Inf Syst 34(4):1023–1053

    Article  Google Scholar 

  • Shin HS, Kwon HY, Ryu SJ (2020) A new text classification model based on contrastive word embedding for detecting cybersecurity intelligence in twitter. Electronics 9(9):1527

    Article  Google Scholar 

  • Shrestha Chitrakar A, Petrović S (2019) Efficient k-means using triangle inequality on spark for cyber security analytics. In: Proceedings of the ACM international workshop on security and privacy analytics, pp 37–45

  • Tripathy B, Thakur S, Chowdhury R (2017) A classification model to analyze the spread and emerging trends of the zika virus in twitter. In: Behera H, Mohapatra D (eds) Advances in intelligent systems and computing, 1st edn, chap 61. Springer Nature Singapore Pte Ltd., pp 643–650

  • Vasudevan A, Harshini E, Selvakumar S (2011) Ssenet-2011: a network intrusion detection system dataset and its comparison with kdd cup 99 dataset. In: 2011 second asian himalayas international conference on internet (AH-ICI), IEEE, pp 1–5

  • Zhang F, Stromer-Galley J, Tanupabrungsun S, Hegde Y, McCracken N, Hemsley J (2017) Understanding discourse acts: Political campaign messages classification on facebook and twitter. In: International conference on social computing. Springer, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation, pp 242–247

  • Zhang J, Chen X, Xiang Y, Zhou W, Wu J (2015) Robust network traffic classification. IEEE/ACM Trans Netw 23(4):1257–1270

    Article  Google Scholar 

  • Zhou Y, Wang P (2019) An ensemble learning approach for xss attack detection with domain knowledge and threat intelligence. Comput Secur 82:261–269

    Article  Google Scholar 

  • Zhou Y, Cheng G, Jiang S, Dai M (2019a) An efficient intrusion detection system based on feature selection and ensemble classifier. arXiv preprint arXiv:190401352

  • Zong S, Ritter A, Mueller G, Wright E (2019b) Analyzing the perceived severity of cybersecurity threats reported on social media. arXiv preprint arXiv:190210680

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Ariel Rodriguez.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rodriguez, A., Okamura, K. Enhancing data quality in real-time threat intelligence systems using machine learning. Soc. Netw. Anal. Min. 10, 91 (2020).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: