1 Introduction

As data spread exponentially, the number of anomalies (hardware and software) and vulnerabilities in network systems also increase. Nowadays, organizations invest significant money and resources into their security, but hackers continue to raise new threats, disrupt business-critical activities, and compromise customers’ data [1]. These issues happen not only because many companies struggle to counter outside attacks that may affect the most critical business operations but also due to a severe shortage of Cybersecurity analysts [2]. Companies have been trying to hire qualified professionals to fight the escalation of attacks, yet with low success. In 2020, the number of open positions was expected to reach one million worldwide, while in 2021, an astonishing three and a half million [3]. Many education programs and online courses have been created (such as SANSFootnote 1 and Cybrary Insider ProFootnote 2) so that trainees could improve their Cybersecurity skills [4]. Nonetheless, solving the Cybersecurity problem through education and training is not enough. Investments are made in detection and monitoring systems; however, these contributions alone cannot guarantee the enterprise’s safety.

Recommender Systems (RSs) and deep learning (DL) are two potential areas to tackle these problems. The first field is trendy in e-commerce and provides personalized suggestions by learning about the user’s interests. The other domain allows the extraction of meaningful information and flexible operations; however, they need more computation power and training time and raise trust problems [5]. Due to these facts, we focus on RSs and their utility in detecting and treating the incidents raised by Cybersecurity systems. In this area, they could help reduce the human workload by handling the incidents with low criticality [6]. Section 4 demonstrates several projects focused on how RSs models can suggest defensive actions through machine learning (ML) [7], help analysts to navigate in complex tools [8], suggest the most critical alerts [9], among other applications.

In this sense, this survey’s goal is to overview the impact of RSs on Cybersecurity. Only a few of surveys have been published regarding these concepts (detailed in Sect. 4). For example, Gadepally et al. [10] studied the application of RSs for the Department of Defense (DoD) and Intelligence Community (IC). Kozik et al. [11] provided a systematic study of the existing literature in this field. Husák and Čermák [12] studied the presence of RSs in the different tiers of Cybersecurity incident handling.

This survey provides the following contributions:

  • It is the first survey in this area that studies security information event management (SIEM) systems and security orchestration, automation, and response (SOAR) applications to decrease the impact of the shortage of Cybersecurity experts;

  • Presents an updated view about the efforts made in RSs in Cybersecurity;

  • Discusses several open problems that the research community could explore. The usage of RSs to increase the explainability of certain Cybersecurity decisions and its application on SIEMs and SOARs are two possible research lines.

The remainder of this paper is organized as follows. Section 2 describes some general concepts concerning RSs and the projects designed to counter its challenges. Section 3 follows the same pattern as the previous section but for Cybersecurity. Section 4 details the related work about combining RSs with Cybersecurity. The main contributions of this work are discussed in Sect. 5, where in-depth analysis is provided regarding open problems with RSs and their impact on Cybersecurity. Finally, Sect. 6 presents the paper’s conclusions.

2 Recommender Systems

RSs are data filtering tools that provide suggestions to users after collecting and analyzing their historical data. An RS can predict the best option (e.g., item, course of action, or other decisions) for a given user in a particular situation based on the user’s preferences.

In fields like e-commerce, online advertisement, and other areas, RSs can mitigate information overload, provide a personalized experience, increase sales, and other benefits [13]. Another great example is their application by service providers such as Netflix [14], Amazon [15], YouTube [16], and others [17]. These companies use RSs to attract more users and generate more income. According to McKinsey, 75% of what users watch on Netflix comes from product suggestions, and 35% of the purchases on Amazon result from the intervention of recommender approaches [18].

2.1 Techniques

These structures rely on various algorithms to predict the most appropriate recommendation for users. Several approaches can be used depending on the application scenario and its constraints.

The main types of recommendation techniques are content-based (CB), collaborative filtering (CF), demographic-based (DB), knowledge-based (KB), and hybrid [19]. Later, Golbeck [20] complemented the previous work with another category named community (CoB), a group that uses the user’s friends’ opinions to make recommendations. At the same time, DL is introduced in RSs to enhance recommendation capabilities.

CF techniques are the method most used and assume that items are suggested to a specific user based on other users’ preferences with similar tastes, transaction histories, and characteristics [21]. CF approaches assume that users who agreed in the past will agree in the future and like similar items as they did in the past. The rating history of the two individuals demonstrates how similar their preferences are. Many similarity measures have been used to calculate the similarity degree in RSs, such as Minkowski distance, Euclidean distance, Pearson Correlation coefficient, cosine similarity, and others [22]. Isinkaye et al. [23] divided CF into model-based and memory-based. The first set uses several data mining techniques such as clustering, neural networks (NN), and other methods, while the latter relies on similarity measures between items and users. CF approaches can generally achieve accurate recommendations without comprehending the item itself and don’t require content inspection and extraction, but it suffers from sparsity and cold-start problems.

CB approaches make recommendations based on item-to-item similarity and the user’s profile preferences [24]. CB-based methods assume that items similar to others liked by the user in the past are the most likely suggested. The item’s features determine how similar an item is compared to another. Usually, the system builds a CB profile of users based on a weighted vector of item features. The weights correspond to each feature’s importance and can be computed from individually rated content vectors using simple approaches (the rated item vector) or ML methods. Overall, CB approaches provide transparent and detailed information about the features considered, but they are limited in recommendation customization and suffer from over-specialization.

DB algorithms allow suggestions based on demographic attributes [25]. DB models study features such as gender, age, language, religion, health, mobility, and other information and form “people-to-people” correlations to reach conclusive suggestions. DB approaches don’t require a history of user ratings and can boost other RSs’ power. As limitations, they present poor results when working as a stand-alone component and may raise privacy problems.

KB techniques recommend items or actions based on explicit knowledge about the item’s features, user interest, and recommendation criteria [26]. Here, an RS combines user preferences data with domain knowledge through similarity measures to find suggestions that may benefit the users. This breakdown is usually reached when looking into the knowledge base and the user profile. KB systems can get noise-free recommendations and don’t suffer from data sparsity and cold-start issues. Still, they are hard to scale, and knowledge acquisition can be costly and difficult to maintain.

CoB models suggest something based on the user’s friends’ opinion [27]. It assumes that recommendations provided by social networks (such as Facebook, Twitter, and others) are more reliable and trustworthy than the ones obtained with similarity measures [28]. With the continuous growth of social data and the number of relationships between users, this has become a suitable method to counter social media information overload. CoB techniques provide explainable results but struggle to capture all the associations between users and user behavior, as interests constantly change.

With enough resources and computation power, DL could help surpass some of the issues mentioned in Sect. 2.2 (for example, scalability and sparsity) and improve the recommendation quality. DL models, however, suffer from a lack of interpretability, the need for hyperparameter tuning, and others. Yet, they can significantly enhance RSs’ performance. For example, DL features can capture the nonlinear and non-trivial user-item relationships and deal with sequential modeling tasks such as machine translation, chatbots, and other advantages.

Hybrid strategies are the last category and probably the most suitable approach in RSs. These structures combine multiple methods to alleviate their drawbacks while taking advantage of their features. It follows the idea that combining different algorithms, such as CB with CF, will provide more accurate recommendations [29]. Despite optimizing the suggestion accuracy, using hybrid systems requires an analysis of the application scenario and the intended features to minimize costs and complexity.

2.2 Challenges

RSs can speed up searches, access more accessible content, gain a competitive advantage, and other benefits. Still, they are affected by the lack of scalability, sparsity, privacy, cold-start, serendipity, over-specialization, dynamic systems, and presence of shilling attacks, and gray sheep intrusion [30, 31]. Only a few research projects are included to conserve space.

2.2.1 Scalability

One of the questions most targeted by the research community and industry is maintaining a scalable system, while the number of items and features grows exponentially. Usually, as the volume of data increases, the threat of poor scalability increases, requiring more investment and resources. The research community has been exploring ways to improve scalability. For instance, Cacheda et al. [32] studied the tendencies or differences between users and items instead of their similarities. Kumar and Sharma [33] planned a weighted slope one scheme and item classification (k-Means). Yu [34] introduced a scalable matrix-completion algorithm in CF systems with trace norm regularization. Georgiou and Georgiou [35] designed a genetic-based clustering method that creates dense clusters (their representatives help minimize the time complexity). Zhang et al. [36] studied several DL models (autoencoders, convolutional NN, and attentional models) to facilitate scalability and sparsity and raise interpretability.

2.2.2 Sparsity

RSs are also confronted with the lack of information about the transactions or feedback data when rating an item, action, or others. The presence of sparsity prevents accurate suggestions with specific interests because there won’t be anyone similar to the case at hand. As in the previous challenge, several ideas tackle the sparsity problem. Lu et al. [17] surveyed RS in several scenarios (e-government, e-commerce, and others) and how sparsity cases could be avoided. Idrissi and Zellou [37] researched the efforts made by RSs under sparse data and the evaluation criteria required. Pan et al. [38] presented a tool based on matrix factorization that benefits from the user and item data from other domains. Devi et al. [39] employed a probabilistic NN in CF systems due to their ability to quickly calculate the trust between the users. Anand and Bharadwaj [40] introduced a platform that uses local and global similarities to enhance the neighborhood. Qiu et al. [41] combined linear regression and multi-dimension similarity to create feedback. Meng et al. [42] introduced a privacy-preserving and sparsity-aware location-based prediction algorithm that applies a data obfuscation solution and region aggregation mechanism to handle privacy and sparsity.

2.2.3 Privacy

Another difficulty confronted by these systems is the loss of privacy. Retrieving user preferences raises many privacy concerns because the data collected may contain sensitive information (age, gender, and other personal data). Despite, in most cases, users being prompted with privacy policies concerning how RSs use their data, they usually have no direct control over the data or understanding of how and where this data is used. Many privacy-related surveys have been issued with the rising impact of Cybersecurity threats [1, 2]. Jeckamns et al. [43] researched privacy concerns in RSs in terms of user information (user preferences, history, and others) and research areas (law, awareness, and others). Asela and Guy [44] studied the evaluation metrics about RSs and the properties shared by them. Milano et al. [45] reflected upon the ethical challenges inherited by RSs, such as privacy, opacity, fairness, and other features. Regarding solutions proposed, Zhan et al. [46] studied a commodity-based scalar product method to prevent the leak of user data. Kobsa and Knijnenburg [47] investigated the cognitive process behind user decisions made by context-based RSs regarding information disclosure. Xin and Jaakkola [48] studied the amount of public user data required to reach satisfactory accuracy.

2.2.4 Cold-start

Cold-start scenarios also disturb the possibility of suggesting something to users.

Fig. 1
figure 1

Cold-start [49]

Figure 1 displays when a cold-start case occurs (an unknown individual or item enters the system). Depending on the recommendation approach, an RS may struggle to predict items or actions or reach accurate recommendations if the user’s tastes are unknown or no one has rated a new item. This topic is discussed in works that tackle previous challenges; however, other articles cover this aspect. Kamps et al. [50] studied the continuous cold-start, also known as CoCoS (users or items that aren’t updated frequently), and its impact on content and context-based e-commerce RSs. Gope and Jain [51] studied several cold-start techniques and divided them into explicit and implicit groups. The research community has proposed many particular methods to cope with cold-start. Shaw et al. [53] studied the usage of association rules (based on the Taxonomy-driven Product Recommender [52]) to expand the user profile. Kim et al. [54] deployed error-reflected models by analyzing the prediction errors to make new predictions. Lika et al. [55] implemented a system that uses classification approaches, semantic similarity metrics, and prediction mechanisms to ease cold-start.

2.2.5 Cyberattacks

Cybercriminals conduct various attacks to steal RSs data or manipulate the rating/popularity of an item or action, such as shilling, data poison, adversarial, and gray sheep. The most common type is the first attack, which works similarly to data poison and adversarial. In a shilling scenario, users provide fake data to exploit ratings and reviews and increase/decrease the item’s rank [56]. The gray sheep is a sub-type of shilling attack and is characterized by using the user’s opinions that do not match any group to reduce the recommendations’ accuracy [57]. As for data poison attacks, they try to tamper with the training of ML models to decrease their performance, increase training time, and induce misclassification [58]. Adversarial attacks are similar to the previous attack type, but instead of manipulating the training data, they can only exploit testing models [58].

It is also important to mention that manipulating particular objects doesn’t necessarily mean there is cybercriminal activity. For instance, Sony Company used fake reviews to engage more clients in seeing their movies [59]. These intrusions are common in CF tools and can break the trust, fairness, performance, and quality of the RSs. With the growth of cyberattacks, some surveys have been presented. For example, Williams et al. [60] extensively analyzed many attack models and their impact on RSs and introduced a classification-based tool for detecting attack profiles. Kaur and Goel [56] simulated the impact of the most common shilling strikes (random, average, bandwagon, and segment) in RS. Deldjoo et al. [61] surveyed several attack and defense methods in recommendations models with adversarial machine learning (AML) techniques. Himeur et al. [62] researched the security and privacy elements of modern RSs, their applications, and their future.

Concrete approaches in this area usually fall under two main categories: detection and defense against these attacking structures and intrusion enhancement. Regarding the first group, Hofmann et al. [63] implemented a matrix factorization algorithm based on robust M-estimators to deliver valid suggestions even with spam caused, for example, by shilling attacks. Ghazanfar and Prugel-Bennett [64] showed a different version of K-Means++ to find gray sheep users (GSU). Boyer et al. [57] utilized distribution-based methods and information retrieval to detect GSU in CF RSs. Agnani et al. [65] took advantage of the distribution of the user similarities and application of local outlier factors to identify gray sheep cases. On the other hand, some attacking tactics have been explored. For instance, Singh et al. designed an improved data poison attack model for factorization-based CF systems through alternating minimization [66] and nuclear norm minimization [67] approaches to study potential defensive strategies [68]. Fang et al. [69] introduced an optimized poison attack technique to manipulate graph-based RSs and studied ML methods that can detect fake users. Christakopoulou and Banerjee [70] devised a framework that injects adversarial user profiles into oblivious RSs to target the top-K recommendation quality. Fang et al. [71] proposed a data poison attack for matrix factorization-based Top-N RSs by optimizing the ratings of a fake user with a subset of influential users. Xiao et al. [72] introduced a new Augmented Shilling Attack framework built with a generative adversarial network (GAN) capable of generating fake and adaptable profiles to promote items for a group of users. Anelli et al. [73] presented a method that enhances the shilling attack capabilities for CF systems by including semantic-encoded information extracted from knowledge graphs. Sundar et al. [74] studied shilling attacks for online environments.

2.2.6 Serendipity

Serendipity is closely related to diversity and novelty properties. It happens when users receive and like suggestions different from their preferences. These hints may cause changes in users’ tastes and even extend their interests to other domains. A few surveys have gone through serendipity and its relevance in RSs. Kotkov [75, 76] examined the paradigm of serendipity in this area. Kaminskas and Bridge [77] surveyed some projects involving serendipity and the available optimization strategies. Zhao et al. [78] examined serendipity in RSs using user feedback in a movie scenario. Badran et al. [79] inspected the role of serendipity in RSs, and their relevance to user satisfaction. Ziarani and Ravanmehr [80] provided a systematic literature review on serendipity-based RSs, focusing on the definitions, challenges, and other topics. Zuva and Zuva [81] built a recommendation model that ensures serendipity, accuracy, and diversity. Rao et al. [82] created a scheme that builds suggestions for users with serendipity without sacrificing accuracy. Wang et al. [82] designed a greedy reranking approach for RS that enhances serendipity with feature diversification.

2.2.7 Over-specialization

Over-specialized suggestions prevent users from discovering new insights. It usually happens when an RS can only suggest something favorable to a user’s profile (recommends items like others already rated). The occurrence of these events hurt the delivery of serendipity recommendations. Besides the related work already shown in previous limitations, only a few projects are included. Yahia et al. [83] used an explanation-based diversity approach to decrease the over-specialization in CB and CF systems. Adamopoulos and Tuzhilin [84] proposed an improved version of the classical neighborhood approach for over-specialization and concentration bias. Panagiotis Adamopoulos [85] introduced a recommendation technique that considers unexpectedness.

2.2.8 Dynamic recommendations

Another challenge in recommender approaches is understanding how to deliver a recommendation without user requests (implicit requests). Recognizing when a user needs an implicit request allows RS to be more proactive and improve the user experience. It is a more restricted subject where few articles are available. Lovelle et al. [86] studied implicit techniques and their correlation with explicit feedback in an electronic books RS scenario. Lee et al. [87] explored implicit temporal information (purchase time and item launch time) for the generation of pseudo rating data. He et al. [88] introduced an element-wise ALS based on view data (purchased, viewed, and non-viewed interactions) to enhance implicit RSs. Later, this method is improved with pointwise regression (PointVALS) and pairwise ranking (PairVALS) [89]. Session-based RS could also help boost this aspect since they provide dynamic user preferences allowing more precise recommendations [90].

2.2.9 Explainability

The last problem faced by RSs is the absence of explainability/interpretability in the recommendations. Incorporating this concept into the suggestions could help users/organizations understand why certain items are picked instead of others. Furthermore, it can enhance the users at different levels (confidence, trust, and satisfaction). A few surveys address the current state of explainable recommendations. Manolopoulos et al. [91] presented a taxonomy for explanation styles (human, item, feature, and hybrid) and studied their impact on social RS. Al-Taie [92] provided an overview of explanations in RSs and the related work in the area. Zhang and Chen [93] explored explainability solutions considering the type of explanations (visual, social, and others) and model (mining, post hoc, and others). Abdollahi and Nasraoui [94] debated the role of explainability in ML and several explanation styles and methods. Wang et al. [95] discussed how important explainability, transparency, and other properties are for trustworthy RSs. In recent years, several proposals have been presented to increase explainability. Chang et al. [96] built a system that uses a multi-modal aspect-aware topic model and aspect-aware latent factor model on text reviews and images. Bellini et al. [97] used a Semantics-Aware Autoencoder to construct explanations for recommendations through knowledge graph data. Adadi et al. [98] investigated the use of knowledge-based extraction approaches (ad hoc model-agnostic) in RSs to retrieve rules relevant to the explanation process.

3 Cybersecurity

Cybersecurity consists of protecting and recovering information assets comprised of data, systems, networks, devices, software programs, and interoperability-enabling services from cyberattacks. Usually, these attacks relate to access, change, or even the destruction of sensitive and private data, extortion of money from the entities explored, or disruption of business processes. MITRE ATT@CK [99] published many taxonomies with the techniques adopted by cybercriminals to surpass the organization’s security.

Lately, a high number of Cybersecurity problems have been reported by organizations in all sorts of areas. While many of these alerts may feel more like alarmism than actual threats, the reality is that Cybersecurity is facing many issues at an unprecedented rate. With the evolution of cyberattacks (e.g., phishing, malware, and other malicious activity) and the continuous expansion of networks, Cybersecurity has become a significant priority. As more enterprises understand and identify their security framework and vulnerabilities, the industry has become indispensable to day-to-day operations. However, despite increasing security and preventing possible attacks, security is still an issue in many businesses (for example, in the Internet of things (IoT) [100]). Therefore, Cybersecurity has been trying to be more vigilant by moving into a more adaptive approach instead of a perimeter-based one (the National Institute of Standards and Technology or NIST provided some guidelines).Footnote 3 The research community has been looking for more defensive mechanisms to reduce the breaches’ effect and find vulnerabilities in several critical areas like mobile [101], IoT [102, 103], control systems [104], and others.

The recent effects of the global pandemic put even more pressure on Cybersecurity and its services. People were forced to work at home and use cloud tools. As a consequence, this change created more opportunities for cyberattacks. According to the Cyber Incident Response and Analysis Cybersecurity report, three out of ten organizations felt a spike in the volume of attacks.Footnote 4 Another interesting fact mentioned in the Report Verizon Data Breach Investigations Report 2020 was that the healthcare industry suffered an increase of 58% in intrusions. Rob Sobers presented several statistics showing the impact of COVID and other Cybersecurity issues.Footnote 5 McKinsey detailed two main security strategies to fight against cyberattacks, such as securing work at home and ensuring the reliability of the network traffic [105].

3.1 Challenges

Cybersecurity is essential to our daily lives because it can discover potential breaches and prevent disruptions in critical systems, such as power plants, hospitals, and financial companies. However, its efficiency highly depends on ransomware, IoT threats, cloud security, ML attacks, cryptocurrencies and blockchain, privacy policies, and shortage of talented staff [1, 106, 107]. All these problems are studied, emphasizing the lack of qualified personnel since RS could play a relevant role. Only a few works are included to save space.

3.1.1 Ransomware

One of the most prevalent issues in Cybersecurity is ransomware. As illustrated in Fig. 2, ransomware, also classified as one type of advanced persistent attack (APT), starts by getting stealthily inside the victim’s systems so that he can encrypt the data to sell it later or ask for ransom.

Fig. 2
figure 2

Ransomware pattern [108]

Cybercriminals use phishing mechanisms, social engineering, and other methodologies to invade and exploit the system’s vulnerabilities. Several surveys discuss ransomware and the existent prevention techniques. Aleem et al. [109] classified multiple ransomware types according to their type, platform, and other features. Maarof et al. [110] elaborated a ransomware taxonomy based on severity, platform, and target and studied the requirements for a successful attack. Humayun et al. [111] explored the evolution, prevention, and mitigation methods for ransomware in IoT. Sharma et al. [112] focused on Android ransomware detection and future research lines. On the other hand, Oz et al. [113] provided a ransomware taxonomy and reviewed ransomware defense. In terms of approaches, Andronio et al. [114] invented a system that recognizes when ransomware applications are trying to lock the device, encrypt their data, or both. Arshad et al. [115] proposed a framework that creates an artificial environment to expose ransomware. Hou et al. [116] developed an Android detection tool that enhances ransomware discovery through structured heterogeneous information networks (HIN).

3.1.2 IoT vulnerabilities

With the introduction of IoT, devices and tasks can be connected and automated. However, it also opens the doors for hackers. Typically, the lack of security in digital components (for example, smartphones) is due to insecure web interfaces, lack of knowledge about safety, weak passwords, and vulnerable WiFi networks, among other aspects. One of the areas most affected by these intrusions is the Internet of Medical Things because the patient’s data and lives are at stake [117]. The research community has been exploring the potential of IoT in this area. Morchon et al. [118] studied privacy threats in IoT, such as profiling, inventory attacks, and others. Alghamdi et al. [119] dissected the IoT structures (devices, services, and both) and their security threats, attacks, and limitations. Tsai et al. [120] inspected the data mining application for IoT. Miani et al. [121] explored intrusion detection systems (IDS) and proposed a taxonomy based on detection and validation methods, threat level, and IDS location. Jones et al. [122] assessed IoT-based health systems, their risks, security threats, and some correctives to them. Lipman et al. [123] presented an in-depth analysis of IoT cyberattacks and the methodology usually followed by them. Concerning methods for IoT security and privacy, Khayami et al. [124] created an IoT malware hunting method built with LSTM techniques to evaluate ARM-based IoT applications’ execution operation codes (OpCodes). Doshi et al. [125] investigated DDoS (distributed denial of service) tracking in IoT devices with a packet-level ML algorithm.

3.1.3 Cloud security

Small and big companies rely on cloud storage to secure their data. However, the cloud can’t guarantee the safety of the information stored. Hackers exploit weaknesses like cloud misconfigurations, insecure APIs, meltdowns, data loss due to natural disasters, and even human error to access confident data. A vast number of surveys and solutions have been presented on this topic. Azeem et al. [126] overviewed the cloud security vulnerabilities, attack types, and the proposed state-of-the-art tactics. Asal et al. [127] reviewed cloud security with the target on cloud security assurance and its repercussions. Tamrawi [128] investigated the role of the cloud in health services and some concerns like the lack of security, privacy, and other properties. Goyal and Kumar [129] presented a cloud security taxonomy where requirements, risks, weaknesses, and solutions are mapped. Mangla et al. [130] described the DDoS attack and how they can be detected and prevented in cloud environments. As alleviating approaches, Bonti et al. [131] proposed the Cloud Protector, a backpropagation neural network method for tracking HTTP-DoS and XML-DoS attacks. Mohamed et al. [132] presented IDPS, a collaborative intrusion detection and prevention system that uses anomaly detection and pattern matching to find distributed and port scanning attacks. Xing et al. [133] developed SDNIPS, a software-defined networking-based intrusion prevention system capable of detecting malicious activity and reconfiguring its network in real time.

3.1.4 ML attacks

ML brought many benefits to many areas but can also be a double-edged sword. Its usage may improve the performance of systems and enhance decision-making, but hackers can also take advantage of ML to create intelligent and sophisticated attacks. In recent years, ML has received much attention with reviews and projects. Liu et al. [134] studied security in ML from two perspectives (training and test phase) and categorized the defensive solutions into four groups. Loukas et al. [135] established a taxonomy for AML techniques to address the related work, applications, and pending questions. Chang and Al-Rubaie [136] researched multiple ML techniques with special attention to data collection and privacy violation. Xue et al. [137] surveyed the security aspect of ML, attack models, defensive mechanisms, and security evaluation algorithms. Ye et al. [138] also discussed privacy risks within ML models and other features. Gupta et al. [139] inspected the ML and DL tracking approaches. They presented SDA, a DL and ML-based secure data analytics model to distinguish regular and malicious events. Ahmad and Alsmadi [140] composed a systematic literature review about the cooperation of ML, IoT, and information security. Swami et al. [141] crafted a black-box attack scenario where the attacker struck a deep NN hosted without knowledge about the target. Ludwig et al. [142] devised a method to detect and filter poisoning or causative attacks in the training step through data provenance. Hou et al. [143] built SecureDroid, a framework that provides security for Android devices against adversarial attacks through SecCLS (feature selection) and SecENS (ensemble learning) techniques.

3.1.5 Cryptocurrencies and blockchain

With cryptocurrencies and blockchain protection still in the early stages of implementation, hackers manipulate the systems and access private data maliciously. Cybercriminals can steal data through Eclipse, Sybil, and DDoS if security controls aren’t introduced correctly. Sai et al. [144] tested the security and privacy of cryptocurrency applications and suggested DroidSafe, a static code analyzer that identifies threats. Gosh et al. [145] discussed blockchain-related work, its challenges, applications, and other features and categorized blockchain into permission-based and participation-based. Manco et al. [146] used a DL model (encoder–decoder) that receives blockchain network information (time series data) to diagnose breaches. Jantan et al. [147] presented a hybrid algorithm (SVM, KNN, and random forest) to discover cryptocurrency mining malware that invades systems with fileless attack tactics. Yazdinejad et al. [148] created a deep RNN model to track cryptocurrency malware in MS Windows opcodes sequences.

3.1.6 Privacy policies

Legal protection laws are another Cybersecurity obstacle. Regulations, such as General Data Protection Regulation (GDPR),Footnote 6 California Consumer Privacy Act (CCPA)Footnote 7, and others makes it difficult for companies comply with these legislation documents, and, usually, end up increasing the complexity of the business. Some works have been published regarding the effects of GDPR on business operations. Tom et al. [149] sketched a GDPR model to help create privacy policies and assessment tools for GDPR compliance. Li et al. [150] discussed the opportunities and challenges posed by GDPR when developing solutions in China and the United States. Kutar and Addis [151] inspected the influence of GDPR in company management and the importance of FAT (Fairness, Accountability, Transparency) in artificial intelligence (AI) methods. Regarding ways to ameliorate GDPR integration, Sun et al. [152] suggested a BC-based personal data management tool so that all of its components (service provider, data processors and controllers, and resource servers) meet the GDPR conditions. Piras et al. [153] presented DEFeND, a data privacy governance for GDPR support that benefits from a Privacy by Design approach to help businesses fulfill their requirements and comply with GDPR. Kala et al. [154] studied a GDPR tool based on the AS-IS compliance method that identifies and explains non-compliance problems while recommending possible solutions.

3.1.7 Cybersecurity experts shortage

The last challenge is the lack of skillful people handling cyber threats. Overall, the number of attacks keeps increasing rapidly, but the number of IT (Information Technology) analysts doesn’t, causing work overload and other business problems.

As shown in Fig. 3, this phenomenon poses many business issues. Organizations have been updating their security certifications and creating more security training courses to mitigate the lack of skilled staff and other approaches. Contrary to the previous concerns, research in this field has been toward user assistance rather than creating new strategies to detect/treat cyber events. For instance, two types of technologies that can have a significant impact on successful security operations even when there are fewer qualified people available are security information event management systems (SIEMs) and security orchestration, automation, and response (SOAR) [155].

SIEM solutions play a crucial role in enterprises since they can get insight and keep records of IT activities. They analyze the customer’s data and provide real-time threat monitoring, event correlation, and incident response. Their introduction allowed the analysis of larger datasets, unraveling hidden threats, and optimizing several other operations among the business tasks. Bhatt et al. [156] discussed the role of SIEMs tools in the security operations center (SOC). These systems may aid SOCs in several tasks (event collection and normalization, cross-event correlation to find hidden patterns and other tasks). Granadillo et al. [157] proposed two alert correlation techniques (defense-based and metric-based alert correlation) for SIEMS. Sancho et al. [158] suggested Viewnext-UEx, a threat classification and prioritization model based on knowledge collected from SIEMS. Lampe et al. [159] investigated the application of SIEMs as security as a service for cloud environments. Sun et al. [160] developed a cloud forensics framework in a virtual-cloud environment using a SIEM architecture. Vallini et al. [161] proposed an enhanced security information and management system capable of enduring Cybersecurity issues in critical infrastructures. Adam and Ping [162] developed a security event management tool for SLA-driven monitoring and correlating events in 5G scenarios. Mulyadi et al. [163] extended SIEMs with Dockerized Elastic Stack to support data collection and storage of log data.

Fig. 3
figure 3

SOC shortage impact

SOAR frameworks complement SIEMs in helping SOC teams manage and respond to the incidents collected; however, SOAR takes several steps further. While SIEMs tend to alert security teams, SOAR exists to augment their capabilities. They use security orchestration, automation, and response approach, leading to less vulnerable structures and more efficient security teams [155]. Some opinions and research projects have been shared to clarify the impact of SOAR tools. Brewer [164] studied how current organizations can benefit from including SOAR infrastructures. Islam et al. [165] published a Multi-Vocal Literature Review on security orchestration where features, challenges, core components, and possible application areas are detailed. Mohammad and Lakshmisri [166] discussed the importance of security automation (one of the pillars of SOAR) in Cybersecurity. Luo and Salem [167] studied the Service-Oriented Software-Defined Security, an approach that simplifies Cybersecurity management tasks by separating the management part from the security controls based on Software-Defined Security ideals.

The next chapter details the related work regarding the impact of RSs in Cybersecurity.

4 Recommender Systems in Cybersecurity

The literature review is written in a systematic form [168], and it depends on the analysis of papers collected from five academic databases: Google Scholar, IEEE Xplore, SCOPUS, ACM Digital Library, and ScienceDirect. In our search, we applied keyword search based using the terms “recommendations,” “defensive actions,” “Recommender Systems,” and “Cybersecurity.” The papers were collected based on the titles, abstracts, and keywords to determine the most relevant articles for further analysis. Since RSs with Cybersecurity is a recent area, only a handful of works are available. However, the existing ones demonstrated how these concepts could improve Cybersecurity decisions and incident treatment through recommendations. The selection criteria hinged on surveys in the area, prediction and defense models, support tools, and extension to other domains (in this order). Table 1 shows the distribution of the articles collected considering the selection criteria.

Table 1 Related work distribution

4.1 Surveys

Gadepally et al. [10] studied the application of RSs for the DoD and IC applications. Their research is focused on the combination of RSs with Cybersecurity and some programs established in the Lincoln Labs, where these two concepts are mixed and studied. The authors point out that using RS applications in DoD and IC areas differs entirely from commercial operations. For example, while in commercial solutions, success is measured by a concrete action (i.e., a product is sold), it depends on probability and speculation in DoD usage. Many differences were illustrated in a table on the research paper [10]. They also mention the need to clarify ethical and technical questions regarding user trust, privacy and security preservation, user environment adaption, multi-level metrics, system extension, and partnerships between the academy and industry. Their laboratory had seven projects combining RS with Cybersecurity at the time of publication. Overall, this study is an all-around paper that shows how these topics can complement each other when applied in the defense domain. Another strong point is the authors’ depth and variability in future work. This work is very different from ours since it focuses only on projects designed by their laboratory, while our survey covers ideas from several researchers from other groups.

Kozik et al. [11] investigated the advantages, disadvantages, applications, and research work available in the area. Furthermore, the authors try to answer some research questions regarding the state of the art of RSs in Cybersecurity and the approaches used. Despite presenting a broad view of this topic, the research community could benefit from a more elaborated future research section. Like our survey, this study goes over RSs techniques and their related work on Cybersecurity; however, they are very distinct regarding future work.

Husák and Čermák [12] surveyed the impact of RSs on Cybersecurity incident handling and response automation and designed a taxonomy for it. The authors discuss the incident handling task and its different tiers (Triage, Incident Response and Analysis, Intelligence, and Incident Management). Each phase is inspected in terms of automation (if it is possible to automate and the level of extension), how RSs may help them in the decision process, and existing research projects. Furthermore, they identify the research groups and publication venues (journals and conferences) responsible for studying this topic. One interesting idea concluded by this work is that, rather than tackling all the phases mentioned, most existing works prefer using an RS for particular tiers. Contrarily to our survey, this work organizes the literature in a taxonomy (incident handling tiers) and overviews publications venues, and research groups. Still, it provides less analysis of the future work than our survey and doesn’t review the different RS techniques.

4.2 Prediction and defense strategies

Lyons [169] developed a hybrid system capable of predicting future attacks and generating an ordered list of cyber-defense actions. Her thesis studies the possibility of employing an RS, using the CF algorithm as the attack predictor and a KB approach to create defensive actions and avoid sparsity issues. The solution uses a virtual network, a host server machine, an IDS, and an RS. As for performance measures, the Root Mean Square Error uses the recommendation speed and other variables. In general, the system’s performance was affected by the configuration of client machines, the configuration of the RS machine, IDS alerts (the output is used to create defensive options), and the RSs’ initial state before executing the algorithm and the recommendation method applied. Despite obtaining weak prediction values, the proposed tool provides defense actions capable of treating threats (with the desirable order). In future work, Lyons refers to using a CB approach instead of CF with KB to improve the recommendation process (more accurate). Compared to the previous method, the proposed model uses two recommendation techniques and provides defensive options. As suggestions, the system could be tested in real environments.

Polatidis et al. [170] studied how RSs could predict future attacks using a parameterized version of the multi-level CF method. At its core, the proposed method identifies attack paths using graphs and uses recommendation processes to predict future cyberattacks. This approach is based on attack path discovery and attack prediction. The attack path discovery method inspects several features from the attacker, such as location, capability and entry, and target points. The attack prediction technique receives information from the first stage (non-circular attack paths) and uses an RS to forecast and classify possible future threats. The proposed approach is evaluated using the maritime supply chain IT infrastructure data and tested within a Cybersecurity maritime supply chain risk management system. After comparing the proposed methods with others available in state of the art, testing its performance in a maritime supply chain infrastructure, and hearing the opinions of five experts, the authors concluded that graph analysis and recommendation strategies are valid options for predicting incoming breaches. The authors discuss the path length recommendation and the cyberattack prediction process. This project is well organized and could take advantage of being evaluated in a real-world risk management system.

Carvalho et al. [171] proposed a text mining-based RS called VulIntel (Vulnerability IntelliSensor) to detect insecure coding segments and recommend actions for them. Such an approach comprises two main modules: Data Analyzer and an RS. The first element gathers intel from the National Vulnerability Database (NVD), Common Vulnerabilities and Exposures (CVE), and open-source programs through MapReduce. Apache Hadoop extracts detection features from this data, which the RS then uses to assess the safety of the code. Including Intellisense technology in the RS suggests possible fixes similar to the case. The solution presented is judged in terms of usability and scalability. The usability study is deployed with A/B tests and fourteen participants. Comparing the results with FindBugs (vulnerability tool), the ANOVA (analysis of the variance) demonstrated the application of RSs for secure coding. The scalability component is interpreted with ten random Google Code projects processed for SQL Injection. VulIntel is capable of scaling as expected when scanning these projects. Many improvements are advocated, like using DL to select the vulnerabilities features, support more languages and IDEs, and other enhancements. This article focuses on helping and training professionals to upgrade their coding skills while building new products. Although the few participants in the usability study, the report is accurate with several research lines.

McDonnell et al. [172] introduced CyberBERT (Cyber Bidirectional Encoder Representations from Transformers), a deep session-based RS that targets the intentions of anonymous users through dynamic state models. These schemes use bidirectional transformers (enhance the user representation) to seize the user intention during each session. This capture allows CyberBERT to perform malware recognition and next-click prediction in critical areas like aviation and aerospace. The tool proposed is evaluated in two scenarios. The first case uses the Windows PE Malware API dataset and compares the ability of CyberBERT dispatching malware classification with LSTM and Transformer.Footnote 8 The second assesses the next-click prediction with YOOCHOOSE (RecSys 2015 Challenge) against item-KNN and other methods. The results show that CyberBERT surpasses all the state-of-the-art approaches explored in both scenarios. Regarding malware recognition, CyberBERT achieves higher F1-score and weighted F1-score values. The P@20 and MRR@20 measures demonstrate higher efficiency for the user intention prediction. Future research plans involve the development of an aerospace cybersecurity dataset, malware detection in CyberBERT, and other extensions. It is an interesting project that researches the prediction of user preferences and intentions without prior knowledge.

4.3 Support tools

Rasmussen et al. [7] researched methods to boost real-time monitoring and aid Cybersecurity operators in responding more quickly to alerts generated by an IDS. Their approach rests on a graph-based visualization of correlated IDS output and defense recommendations obtained through ML on historical analyst behavior. They created a Cybersecurity environment called Network Intrusion Management Benefiting from Learned Expertise (NIMBLE), a prototype that evaluates the previously mentioned methods. This framework is tested with eighteen professional analysts, leveraging alert data gathered from operational monitoring systems with some metrics (accuracy, response time, confidence, and ratings). Despite NIMBLE not replicating the analysts’ performance, it improves analyst accuracy with defensible recommendations and visual display. Future research includes multiple linked representations of alert information (with querying and filtering mechanisms), developing an interactive incident diagram to train new staff, and other enhancements. In general, NIMBLE can ease the threats’ display level, suggest actions, and decrease IT analysts’ time required for their treatment. Another exciting conclusion reached is that analysts prefer suggestions accompanied by explanations. Explainable AI (XAI) could have a word regarding this aspect.

Soldo et al. [173] developed a multi-level model to predict future blacklists (list of attack sources). Inspired by Netflix RS, they frame this problem as an implicit RS so that ML approaches can be used to predict malicious activity. Their model uses time series (TN) to assess the temporal dynamics of the attacks and two neighborhood methods: victim neighborhood (kNN) and joint attacker-victim neighborhood. The first approach captures the similarity between victims being attacked by the same sources simultaneously. The second technique adds an extra layer where co-clustering is performed to find a group of intruders that attacks a group of victims at the same time. The testing scenario uses one month of real logs of malicious IP sourcesFootnote 9 and the hit count metric (number of attackers in the blacklist correctly predicted). In this scenario, the proposed model outperformed other state-of-the-art methods such as the local worst offender list, global worst offender list, highly predictive blocklist, and exponential weighted moving average Joint attacker-victim neighborhood. The proposed method is also evaluated against pollution (random false positives) and poison events (prediction is affected by malicious contributors). Results show that the method defended improves the accuracy and robustness against these events. In the future, the authors intend to expand their prediction problem to other dataset features, build a prototype, and make other improvements. Although this paper covers one of the most researched topics in this area (attack prediction), it innovates by analyzing the similarity between victims and attackers.

Nunnally et al. [8] created another visualization module named NAVSEC, an RS prototype designed to simplify the navigation of 3D network security visualization tools. This 3D platform suggests visualizations and interactions to novice users while improving detection, identification, and inspection of the network’s attacks.

Fig. 4
figure 4

NAVSEC architecture [8]

As illustrated in Fig. 4, the NAVSEC system has four main modules: Active User, Expert User Community, Interaction Database, and Recommender. The first component represents a user navigating the visualization tool, while the Expert User Community is a set of users with expertise in the network security and visualization fields. The Interaction Database is responsible for storing and collecting the interaction sequences. The last module parses the data and computes a set of interactions to recommend the active user later. The NAVSEC is evaluated with stealthy port scanning attacks. In the stated scenario, the authors observe how many interactions are made when NAVSEC is enabled or disabled and if the active user’s interactions are similar to the expert’s set of actions. The results show that NAVSEC helps novice users find threats only detectable by expert users. Future work includes the extension of NAVSEC to handle IPv6 addresses and other features. Compared to NIMBLE, NAVSEC provides navigation support but doesn’t give any suggestions regarding defense options. Nunnally et al. also go over the difference between NIMBLE and NAVSEC. The paper is generally well organized, with several images that increase its comprehension.

Campiolo et al. [9] designed a collaboration model to suggest the most relevant Cybersecurity alerts for network administrators. Based on a hybrid approach (CB and CF), the Cybersecurity alerts are extracted from unstructured external data to get more insight into the data and focus on the ones that require more attention. The model created depends on the principles established by Ricci et al. [181], that is, “who are the users?”, “What are the data properties?”, and “What is its application?”. After gathering information about the requirements of the model (response from administrators), Campiolo et al. defined the data model specification to create their hybrid approach. The proposed model is evaluated with an offline experiment and precision and recall metrics. The authors use the MovieLens dataset (no dataset with Cybersecurity alerts) and a website built to recommend movies based on user ratings. The methods (CF and CB) are tested with recall and precision measures. The offline experiment is supported by a web application called Konsilo. They concluded that RSs Systems could improve access to vital information for network administrators. The future work includes using GT-EWS,Footnote 10 a project that enables the suggestion of alerts and filtering of false positives, and the release of Konsilo for better evaluation. This project proposes a novel method for gathering data regarding Cybersecurity alerts. It would be interesting to have more discussion concerning the evaluation.

Abuhussein et al. [174] studied the incorporation of recommender features in cloud computing (CC). They proposed CSSR, a Cloud Services Security Recommender, to ease the comprehension of security and privacy (S &P) risks within businesses. The proposed service starts by identifying a list of S &P attributes from the stakeholder perspective that must be used as controls to minimize the cloud’s vulnerabilities. Later, considering the stakeholder needs, the CSSR model recommends an adequate security solution based on the Attack Vector, Operational Impact, Defense, Information Impact, and Target Taxonomy. For the CSSR performance assessment, the authors use a real-world example (code-spaces, git, and others) and test the method with ten graduate students (who specialized in the area). Their feedback was that CSSR is easy to use and promotes transparency but is complicated regarding accountability. Figure 5 shows the components integrated into the CSSR architecture.

Fig. 5
figure 5

CSSR architecture [174]

In the end, despite the ability to educate cloud consumers about potential S &P issues and other capabilities, the authors identify some problems that might affect the CSSR in the long run, such as the non-cooperation between different cloud service providers. In our opinion, and considering the effects of the current pandemic, the research community and industry should invest more time into approaches of this nature in cloud service since it is one of the most targeted areas by hackers [182].

Sayan et al. [175] developed an intelligent Cybersecurity assistant called ICSA, a tool capable of providing intelligent assistance to a human security specialist. Their system is designed to detect threats, predict future attacks, and recommend actions to stop them. Inside its architecture are KB and hybrid-based features to enhance the assistant capabilities. The proposed system has many components: cyber situation analysis and recognition, cyberattack trend and impact analysis modules, vulnerability and causality analysis modules, and automated and semi-automated responses. They work as a unit to improve intrusion detection and provide valuable suggestions to IT professionals. From our research, no other published ideas involving intelligent assistants have been presented so far. As suggestions, the authors should justify why ICSA is efficient and capable of assisting Cybersecurity professionals. The difficulty in assessing such an approach could be one reason for not knowing the current state of ICSA (published in 2017).

Franco et al. [176] designed MENTOR, a tool that provides recommendations to end-users and network operators about the adequate protection service in particular scenarios. MENTOR tries to meet user demands such as region, deployment time, and price conditions, among other requirements, to find the proper service to prevent and mitigate cyberattacks through the correlation of information. Four components compose the MENTOR recommender process: Extractor, Classifier, Retriever, and Recommendation Engine. The first element receives data about the infrastructure under attack and the characteristics of the attack. The Extraction and Classification analyze and correlate the information with the type of attack to search for solutions against it. Finally, the Retriever collects a list of possible protection services, and the Recommendation Engine, depending on the customer profile, suggests the most appropriate protection service. This tool’s performance is evaluated with four similarity measures (e.g., Euclidean distance, Manhattan distance, Cosine similarity, and Pearson correlation) on a dataset containing 10,000 randomly generated protection services. The results show that MENTOR could suggest the protection service considering the price, geolocation, and other parameters. Instead of choosing the best service in terms of performance, it chooses the cheapest (with distance-based methods). The authors defend that MENTOR is a valid option to recommend services. Yet, it requires more developments, such as using ML to combine different similarity metrics, more research on Cybersecurity decision-making to improve actions, and better recognition of cyberattacks, among other aspects. This project slightly differs from previous approaches since the main goal is to suggest adequate security service through user customization instead of defensive actions.

Sula tackled DDoS attacks with RSs [177]. The author developed an RS for offsite network protection services called ProtecDDoS so that organizations could have a more accurate decision-making process. It was also intended for the user to be provided with essential information about the most suitable DDoS protection services based on the filters he chose beforehand. His approach follows a hybrid scheme with requirements like attack type protection, service type (reactive or proactive), coverage protection region, deployment time, and others. Sula uses Cosine Similarity, Euclidean Distance, Manhattan Distance, Minkowski Distance, and Pearson Correlation to measure the similarity between two entities and evaluate the overall system’s performance. A blockchain-based marketplace supports the proposed solution for more transparency, enhanced security, and improved traceability. The testing of the RS method is done on over 5000 entries. In short, the proposed technique demonstrates promising recommendation accuracy and blockchain cost results. In the future, they plan on using ML to enhance the efficiency of the recommendation model and improve the functionalities provided by the blockchain-based marketplace, among other advances. In general, the current thesis can improve decision-making for DDoS cases. The research community could research other protection services that use recommender services to counter other cyberattacks besides DDoS or adapt ProtecDDoS.

Ayala et al. [6] built a hybrid RS to help Cybersecurity experts deal with anomalies and vulnerabilities while minimizing the response time and workload. This tool combines CF with a KB to prioritize and handle threats with different criticality. Together with experts and security entities (Symantec, OWASP, NIST, and others), the CF model creates the knowledge base about the worst vulnerabilities and anomalies. The KB component is added to mitigate data scarcity. Regarding the system evaluation, the hybrid solution is assessed with six experts from the academic and industry sections using the technology acceptance model. The results show that the RS could aid all the experts in spending less time fixing the issues. From our point of view, this project is the first to study the optimization of the response time provided by human operators. It is a topic of extreme importance that requires more development, especially with the SOC shortage [183].

Ahmed and Nanath [178] reviewed the state of Cybersecurity in Middle Eastern Small and Medium Enterprises (SMEs) and pointed out an RS to mitigate the issues in these organizations. The authors surveyed this area with a questionnaire where several areas were analyzed (network security, endpoint security, and other levels of protection). The feedback shows poor Cybersecurity awareness, limited vendor solutions, and other challenges. Ahmed and Nanath planned a theoretical RS dedicated to SMEs to cope with these issues. The proposed method uses a flowchart recommending the most appropriate solution based on an organization and its Cybersecurity plans. In the future, they intend to implement a web interface to their prototype. Despite suggesting an RS, almost no insights regarding recommendation approaches and evaluation are given.

Huff et al. [179] proposed an RS capable of simplifying the match-up between the organization’s hardware and software and common software product enumerator (CPE) used by standard vulnerability sources such as NVD. It uses natural language processing (NLP), fuzzy matching (FM), and ML to reduce the effort required by human operators in matching software product vulnerabilities. In the NLP stage, the software names are converted into word vectors in a standardized format that could ease the similarity analysis between them. The FM measures the similarity between CPE and the software names through the cosine between two-word vectors. In the last phase, ML orders and filters the FM results to the analysts. The CPEs in the ordered set have an order (Highest, High, Medium, Low, Lowest, and Reject) and level classification (vendor and product). Based on their output, the recommender suggests the CPEs that should be targeted first. The evaluation involves testing the performance with and without the Recommender System in a scenario where 50 Microsoft Windows software inventory package names and 50 hardware inventory names are studied. Regarding software, without the RS, the average number of manual searches is 119, with 8% of them being conclusive; with RS, the average number was 2, with 40% conclusive results. On the other hand, in hardware, without RS, the average number of manual searches was 34, with 28% of them being conclusive; with RS, the average number of manual searches was 2, with 48% being conclusive results. This analysis shows that the RS can deliver more accurate results when compared to the ones obtained without it. Future work includes further testing in a more extensive and realistic dataset. This is one of the first ideas that use RSs to tackle the difficulty of matching the company’s hardware and software to treat CPEs.

Brisse et al. [180] developed KRAKEN, a KB-based RS that suggests exploratory paths within log data to human operators during incident analysis. The idea is to prevent analysts from spending too much time on data collection tasks during an investigation. This tool links knowledge from advanced persistent attack sources (for example, MITRE ATT &CKFootnote 11) into a visual framework (ZeroKit). KRAKEN is evaluated using a subset of the tc3 (Transparent Computing exercise 3) datasetFootnote 12 from the Defense Advanced Research Projects Agency. Seven Cybersecurity experts assess different APT attacks from this dataset with KRAKEN’s assistance. The feedback provided by the human operators shows that KRAKEN provided, most of the time, valuable suggestions without distracting them during incident analysis. In the future, the authors intend to use KRAKEN with other types of RS and fix some issues in this tool (the simple additive weighting decision model tends to overrate objects in specific scenarios). Overall, this paper shows promise in using RSs to speed/ease Cybersecurity tasks and decrease the effects of SOCs shortage.

As shown in Table 2, since 2010, the research community has been exploring the potential of RSs in Cybersecurity. RSs are used as predictors for future breaches, navigation assistants, and other purposes to augment security capabilities. Most related work combines CF with KB, CB, or any other method regarding the approaches. With the recent developments of ML and DL, we expect more ideas where the recommendation technique involves intelligent and dynamic procedures (for instance, the paper [175]). Even though the current related work has achieved satisfactory results, many pending aspects exist to address when studying the coexistence of both topics. In the next section, we will address some of these issues.

Table 2 Recommender Systems in Cybersecurity related work summary

5 Future work

With the growing impact of cyberattacks and the shortage of IT experts in Cybersecurity [183], securing data has become a troublesome task. Nowadays, organizations are overwhelmed by many alerts raised and tools required to track them in their networks [184].

The application of RSs in Cybersecurity could aid security teams in analyzing and processing large quantities of alerts, making predictions, and recommending efficient responses to remove threats. For years, RSs tools have been used in e-commerce to suggest items to users based on their preferences and past behaviors. However, they can also be implemented in other fields like Cybersecurity. For instance, they can generate prioritized lists for defense countermeasures against cyberattacks, detect inside vulnerabilities, monitor network security, and execute other operations [10].

Upon reflecting on the research collected, many possible future work lines are discussed:

There is a severe lack of datasets focused on the user side and how these individuals interact with the incidents (inaccessible or non-existent). The absence of such information disables the application of RSs in Cybersecurity for human operator’s assistance since the recommendation tool doesn’t know what actions should suggest. Most literature reviews and datasets in Cybersecurity focus on detecting attacks and hidden system patterns (intrusion detection) [185]. While better tracking systems are relevant, optimizing the user’s operations should take priority, especially as the shortage of cyber professionals keeps getting worse [186]. From our understanding, synthetic dataset generators accompanied by intelligent user interfaces (IUI) could solve this problem. The first component could help companies minimize costs and privacy concerns and perform better tests. At the same time, IUI tools would provide high and flexible customization allowing the simulation of several dataset features and enhanced data analysis. Over the years, the research community has been exploring artificial systems for dataset creation. For instance, Kim and Kim [187] developed the CTIMiner (Cyber Threat Intelligence Miner), an automated dataset generator that retrieves threat data from security reports and malware databases. The results showed that CTIMiner could generate datasets useful for research projects. Boggs et al. [188] also implemented an artificial dataset generator called Wind Tunnel. This approach builds synthetic datasets with normal and attack data for various security levels and focuses on defending against web application attacks. WindTunnel solves the lack of realistic datasets, granting the evaluation of security controls in different layers. Regarding IUI, these interfaces ease the connection between humans and computers and automate several operations [189]. This automation releases analysts from tedious tasks allowing them to focus on other activities. Akinsola et al. [190] studied the adoption of AI for intelligent user interfaces in Cybersecurity threat modeling to deliver more accurate and quicker responses. According to them, IUI structures can improve the understanding of potential threats. Other traditional techniques are used to gather user data. For example, Sayan [184] gathered intel about the users through questionnaires; Lyons [169] employed sensors for her database; and other methodologies. With more information available (datasets) regarding the user side and intelligent frameworks, security teams would provide faster and more appropriate procedures against the encountered vulnerabilities.

Another consideration is that most projects use the same hybrid approaches to recommend defensive options or forecast possible attacks [191]. Furthermore, the weighted hybrid represents the majority of strategies presented in the area. Although the complexity entailed, hybrid architectures allow the cooperation of multiple strategies to achieve better results. However, the point stands in the methods used to achieve recommendations. Among the ideas presented in Sect. 4, CF is the most popular combo when suggesting actions or discovering attack patterns with other recommendation approaches (CB and KB). With the proliferation of social media, CoB methods could have more representation allowing the usage of more meaningful data [192]. For example, in Cybersecurity solutions, analyzing social media data could provide insights into future attacks. The deployment of new types of RSs like conversational-based [193], reinforcement learning-based [194], stream-based [195], and IoT-based (RSIoT) [196] will raise new problems for Cybersecurity since they involve different requirements and dynamics. Such systems would require more sources of data besides user-item interaction. Context-aware, multi-agent, graph, and Social IoT approaches can investigate users’ relationships and preferences. More recently, the analysis has been moving toward applying DL and ML in RSs. Adopting DL models such as NN, autoencoders, and other structures has shown promising results in the recommendation process but deteriorated the output comprehension due to their black-box nature [197]. The next topic will address explainability and its role in suggestion generation.

Explainable recommendations can increase the trustworthiness and reliability of RSs because users can comprehend the thought process behind suggestions. Delivering accurate recommendations to Cybersecurity experts is essential but is one of many critical factors when handling alerts. Explainability could have a role to play in Cybersecurity since it could bring more transparency and trust, especially in an area where careful analysis is needed to avoid hazards or leaks. Furthermore, explainability coupled with SOCs intervention could boost the confidence of these systems. In a real scenario, explainability may signify a more well-planned and faster response than a non-interpretable case. Among the related work, only one of the works went over explainable recommendations in Cybersecurity [7]. Nonetheless, some recent articles review the security robustness of explainability approaches in cyberspace [198, 199]. With the exploration of DL and XAI, we can expect more developments in the explainability field.

RSs could also ease the information overload in SIEMs and comprehension of complex structures in SOARs. They are used for security monitoring, threat detection (zero-days and other anomalies), forensics and incident response, automation, orchestration, and other activities, but they lack intelligence [155, 200]. We believe their integration on RSs could enhance incident treatment, data visualization, alert fatigue, and other operations [201]. For example, SIEMs rely on IT analysts to decide when dealing with a ticket. If a system provided enhanced incident analysis, the operator would take less time to close it. Another case that would help SIEMS and SOARs is an RS capable of suggesting the most skilled user considering the ticket type and user characteristics. An RS could also guide the user through these systems to decrease complexity. Despite these potential features, the related work doesn’t mention their correlation with SIEMs or SOARs. It is expected more research about the impact of RSs with these applications.

As shown in Sect. 2.2, traditional CF systems are prone to several injection attacks and lack privacy. Intruders/shillers build fake users to manipulate the item’s ratings and access personal data, affecting the fairness and performance of an RS. Since it is impossible to prevent all these attacks, researchers and companies put efforts into shilling detection strategies [202,203,204]. In the Cybersecurity context, attackers could build fake incidents to gather knowledge about the victim’s capacity to handle fake alarms (time, countermeasures, and other features) or deviate the attention of IT experts. Data privacy also raises concerns due to the necessity of collecting large amounts of data and the capability of inferring user interests. The effects of poor privacy practices are more significant in Cybersecurity when compared to e-commerce because the consequences are much more severe. In e-commerce, a leak may expose user data (name, credit card, and other data). In Cybersecurity, disclosing data from an RS may give free access to cybercriminals to create more significant problems (for example, affecting the economy, health, energy infrastructures, and other areas). GDPR and other privacy-preserving techniques limit this abuse from third-party systems and criminals. Moreno et al. provided a set of guidelines to build GDPR-compliant RSs [205]. These suggestions allow the adaption of information systems to current privacy policies. Section 3.1 has other examples showing the impact of GDPR on RSs. Regarding different types of RS, no results were found to test their premises’ robustness.

More research should be directed into the influence of serendipity, over-specialization, and dynamism in Cybersecurity. The RS attributes that received the most attention are scalability, intrusion attacks, privacy, explainability, and other metrics. The presence of these properties represents the majority of Cybersecurity issues. Nonetheless, other unexplored attributes like serendipity and over-specialization could impact the system’s security. From our perspective, these two concepts could provide novel and less repetitive countermeasures against breaches. Furthermore, they could slow attackers’ understanding of the detection and defensive structures since distinct strategies are employed. For example, Cybersecurity systems that depend on predefined and static actions to handle breaches are easy targets for intrusions. ML is being used for protection and hacking purposes, which raises the demand for more “clever” interventions [206]. However, data scarcity and acquisition (information retrieval techniques need to respect privacy policies) influence the performance of these topics. Dynamic systems also deserve more research. While cybercriminals have evolved, defensive structures have remained primarily static (based on reactive approaches instead of proactive methods) [207]. No matter the number of IT professionals available, cyberattacks will continue succeeding unless adaptability and agility techniques are implemented in Cybersecurity. The COVID pandemic accelerated the need for more adaptable solutions [208].

Among the stated considerations, the most challenging topic is the vulnerability and lack of confidence that RSs may provide. The idea of using these tools in Cybersecurity is not to create doubt in the decision process but rather to ease the workflow.

6 Conclusions

Cybersecurity and RSs are two well-known areas with many research projects and tools. DL models could have more exploration, yet, due to hardware constraints and lack of transparency, RSs would be a more suitable choice when integrated with Cybersecurity. Without proper surveys with combined approaches to Cybersecurity RSs, our strategy for this survey was to first review some concerns and methodologies in each topic. Then, we provide their union’s state-of-the-art and in-depth discussion.

This paper shows there is little work on recommender properties in Cybersecurity tools. Some research projects focus on predicting future attacks to decrease the time between discovery and handling threats. Others are more concentrated on the defensive side so that users can apply a faster response while performing a more appropriate solution. With the increasing number of cyberattacks and lack of people capable of handling these situations, we can expect more developments in this area that may turn the scenario around [106, 107]. One of the main areas to be tackled within the next few years is more regulation regarding these systems’ ethical principles, such as transparency, fairness, trust, and others. The increasing popularity of DL and its advantages may also push more attention to these subjects.