Big Data for Cybersecurity
- 506 Downloads
Cybersecurity deals with the protection of computer systems; information and communication technology (ICT) networks, such as enterprise networks, cyber-physical systems (CPS), and the Internet of Things (IoT); and their components, against threats that aim at harming, disabling, and destroying their software and hardware.
In the beginning of the Digital Revolution, the primary goal of computer hackers that attack ICT infrastructures and networks was just to receive recognition from like-minded people. Thus, the consequences have been mostly downtimes and the often costly need of recovering and cleaning up compromised systems. Therefore, the consequences were manageable with rather little effort. In the meantime, the ICT networks’ relevance massively increased, and they became the vital backbone of economic as well as daily life. Since their importance and complexity continuously evolve, also cyberattacks against them become more complex and diverse and consequently difficult to detect. Today’s modern attacks usually target on stealing intellectual properties, sabotaging systems, and demanding ransom money. Sophisticated and tailored attacks are called advanced persistent threats (APT) (Tankard 2011) and usually result in high financial loss and tarnished reputation.
Since usually the protection against sophisticated cyberattacks is difficult and highly challenging, (i) timely detection to limit the caused damage to a minimum, which can be achieved by applying smart intrusion detection systems (IDS), and (ii) preventive actions, such as raising situational awareness, through analyzing and sharing cyber threat intelligence (CTI), are of primary importance. Because of the continuously growing interconnection of digital devices, the amount of generated data and new malware samples per day grows with an exorbitant speed. Moreover, particular types of malware, such as worms, rapidly evolve over time and spread over the network in unpredictable ways (Dainotti et al. 2007; Kim et al. 2004). Thus, it gets more and more difficult to analyze all generated data that might include hints on cyberattacks and malicious system behavior. This is the reason why cybersecurity became a big data problem. According to the four Vs feature of big data (Chen et al. 2014), big data is defined by (i) volume (large amount of data), (ii) variety (heterogeneous and manifold data), (iii) velocity (rapidly generated data that has to be analyzed online, i.e., when the data is generated), and (iv) value (extraction of huge value from a large amount of data).
This section deals with the two main research areas of cybersecurity that are related to big data: (i) intrusion detection, which deals with revealing cyberattacks from network traces and log data, and (ii) CTI analysis and sharing, which supports incident response and decision-making processes. Therefore, the section first splits into two main parts, which are structured the same way. Both first answer the question why the considered topic is a big data problem and then summarize the state of the art. Finally, future directions for the research of big data for cybersecurity are presented. Thus, it is also discussed how intrusion detection and CTI analysis and sharing can benefit from each other.
Countermeasures and defense mechanisms against cyberattacks split into tools that aim at timely detection of attack traces and tools that try to prevent attacks or deploy countermeasures in time. Technical preventive security solutions are, for example, firewalls, antivirus scanners and digital rights, account management, and network segmentation. Tools that aim at detecting cyberattacks and therefore only monitor a network are called intrusion detection systems (IDS). Currently applied IDS mostly implement signature-based black-listing approaches. Black-listing approaches prohibit system behavior and activities that are defined malicious. Hence, those solutions can only detect already known attack patterns. Nevertheless these established tools build the vital backbone of an effective security architecture and ensure prevention and detection of the majority of threats. Nevertheless, the wide range of application possibilities of modern devices and the fast changing threat landscape demand smart, flexible, and adaptive white-listing-based intrusion detection approaches, such as self-learning anomaly-based IDS. In contrast to black-listing approaches, white-listing approaches permit a baseline of normal system behavior and consider every deviation of this ground truth as malicious activity. Thus, it is possible to detect previously unknown attacks.
Intrusion Detection: A Big Data Problem
Because of the rapidly growing number of interconnected digital devices, also the amount of generated data increases. Thus, the volume of data that potentially can be analyzed for intrusion detection is enormous. Hence, approaches are required that can analyze and also correlate a huge load of data points. Furthermore, there is a wide variety of data types available for intrusion detection. The spectrum ranges from binary files over network traffic to log data, just to name a few. Additionally the representation of this data differs depending on the applied technologies, implementations, and used services, as well as operating systems, why analysis of data created in an ICT network comprising various different types of devices is not straightforward. Furthermore, the data is generated quickly, and the velocity is increasing with the number of connected devices. Thus, to provide online intrusion detection, i.e., the data is analyzed when it is generated, which ensures timely detection of attacks and intruders, big data approaches are required that can process rapidly large amounts of data. Since most of the generated data relates to normal system behavior and network communication, it is important to efficiently filter the noise, so only relevant data points that include information of high value for intrusion detection remain.
Intrusion Detection: Methods and Techniques
Like other security tools, IDS aim to achieve a higher level of security in ICT networks. Their primary goal is to timely detect cyberattacks, so that it is possible to react quickly and therefore reduce the window of opportunity of the attacker to cause damage. IDS are passive security mechanisms, which only detect attacks without setting any counter measurements automatically (Whitman and Mattord 2012).
In 1980, James Anderson was one of the first researchers who indicated the need of IDS and contradicted the common assumption of software developers and administrators that their computer systems are running in “friendly,” i.e., secure, environments (James P Anderson 1980). Since the 1980s, the amount of data generated in ICT networks tremendously increased, and intrusion detection therefore became a big data problem. Furthermore, since Anderson published his report on IDS, a lot of research in this area has been done until today (Axelsson 2000; Sabahi and Movaghar 2008; Liao et al. 2013; Garcia-Teodoro et al. 2009).
SD (knowledge based) uses predefined signatures and patterns to detect attackers. This method is simple and effective for detecting known attacks. The drawback of this method is it is ineffective against unknown attacks or unknown variants of an attack, which allows attackers to evade IDS based on SD. Furthermore, since the attack landscape is rapidly changing, it is difficult to keep the signatures up to date, and thus maintenance is time-consuming (Whitman and Mattord 2012).
SPA (specification based) uses predetermined profiles that define benign protocol activity. Occurring events are compared against these profiles, to decide if protocols are used in the correct way. IDS based on SPA track the state of network, transport, and application protocols. They use vendor-developed profiles and therefore rely on their support (Scarfone and Mell 2007).
AD (behavior based) approaches learn a baseline of normal system behavior, a so-called ground truth. Against this ground truth, all occurring events are compared to detect anomalous system behavior. In opposite to SD- and SPA-based IDS, AD-based approaches allow detection of previously unknown attacks. A drawback of AD-based IDS is the usually high false-positive rate (Garcia-Teodoro et al. 2009; Chandola et al. 2009).
Modern sophisticated and tailored attacks, such as APTs, where attackers try to hide their presence as long as possible and therefore try to circumvent detection by signatures and blacklists and respect benign protocol activity, require flexible approaches that provide broader security. A solution to that are white-listing approaches that make use of a known baseline of normal system/network behavior. Furthermore, it is impossible to define a complete set of signatures, because during normal usage, a network reaches only a small number of possible states, while the larger number remains unknown. Thus, while in case of white-listing, an incomplete baseline of normal system/network behavior would cause a larger number of false alarms, i.e., a bigger number of alarms that have to be further analyzed, and an incomplete blacklist leads to false negatives, i.e., undetected attacks. Thus, flexible AD approaches are required that are able to detect novel and previously unknown attacks. While SD and SPA solutions are usually easier to deploy than AD, they depend on the support of vendors. Hence, they mostly cannot be applied in legacy systems and systems with small market shares, which are often poorly documented and not supported by vendors, but often used in industry networks, such as industrial control systems and IoT.
Unsupervised: This method does not require any labeled data and is able to learn to distinguish normal from malicious system behavior while it is deployed without any training. It classifies any monitored data as normal or anomalous.
Semi-supervised: This method is applied when the training set only contains anomaly-free data and is therefore also called “one-class” classification.
Supervised: This method requires a fully labeled training set containing both normal and malicious data.
Artificial neural networks (ANN): Input data activates neurons (nodes) of an artificial network, inspired by the human brain. The nodes of the first layer pass their output to the nodes of the next layer, until the output of the last layer of the artificial network classifies the monitored ICT networks’ current state (Cannady 1998).
Bayesian networks: Bayesian networks define graphical models that encode the probabilistic relationships between variables of interest and can predict consequences of actions (Heckerman et al. 1998).
Clustering: Clustering enables grouping of unlabeled data and is often applied to detect outliers (Xu and Wunsch 2005).
Decision trees: Decision trees have a treelike structure, which comprises paths that lead to a classification based on the values of different features (Safavian and Landgrebe 1991).
Hidden Markov models (HMM): A Markov chain connects states through transition probabilities. HMM aim at determining hidden (unobservable) parameters from observed parameters (Baum and Eagon 1967).
Support vector machines (SVM): SVM construct hyperplanes in a high- or infinite-dimensional space, which then can be used for classification and regression. Thus, similar to clustering, SVM can, for example, be applied for outlier detection (Steinwart and Christmann 2008).
Cyber Threat Intelligence
The concepts of data, information, and intelligence differ from one another in terms of volume and usability. While data is typically available in large volumes, and describes individual and unarguable facts, information is produced when a series of data points are combined to answer a simple question. Although information is a far more useful output than the raw data, it still does not directly inform a specific action. Intelligence takes this process a stage further by interrogating data and information to tell a story (e.g., a forecast) that can be used to inform decision-making. Crucially, intelligence never answers a simple question; rather, it paints a picture that can be used to help people answer much more complicated questions. Progressing along the path from data to information to intelligence, the quantity of outputs drops off dramatically, while the value of those outputs rises exponentially (Chairman of the Joint Chiefs of Staff 2013).
Modern cybersecurity incidents manifest themselves in different forms depending on the attack perpetrators’ goal, on the sophistication of the employed attack vectors, and on the information system the attack targets. Large amounts of data are to be analyzed while handling such incidents in order to derive meaningful information on the relations existing among the collected data units, and eventually obtain cyber threat intelligence (CTI), with the purpose of designing suitable reaction strategies and timely mitigate the incident’s effects.
CTI: A Big Data Problem
Hundreds of thousands of new malware samples are created every day. Considering the large amount of vulnerabilities and exploits revealed at a similar rate, there exists a large volume of data comprising information on cyber threats and attacks that can be analyzed to improve incident response, to facilitate the decision-making process, and to allow an effective and efficient configuration of cybersecurity mechanisms. Furthermore, the variety of the data is manifold. There exist, for example, many different standards describing vulnerabilities, threats, and incidents that follow different structures, as well as threat reports and fixes written in free text form. Because of the large amount of data on cyberattacks generated daily, the data that can be analyzed to derive intelligence is created with a high velocity and should be processed as fast as possible. Since large parts of the data are just available as free text and computer networks show a high diversity among each other, it is of prime importance to select and extract the information with the highest value.
From Raw Data to Cyber Threat Intelligence
Threat information may originate from a wide variety of internal and external data sources. Internal sources include security sensors (e.g., intrusion detection systems, antivirus scanners, malware scanners), logging data (from hosts, servers, and network equipment such as firewalls), tools (e.g., network diagnostics, forensics toolkits, vulnerability scanners), security management solutions (security information and event management systems (SIEMs), incident management ticketing systems), and additionally also personnel, who reports suspicious behavior, social engineering attempts, and the like. Typical external sources (meaning external to an organization) may include sharing communities (open public or closed ones), governmental sources (such as national CERTs or national cybersecurity centers), sector peers and business partners (for instance, via sector-specific information sharing and analysis centers (ISACs)), vendor alerts and advisories, and commercial threat intelligence services.
Operating system services an application logs and provides insights into deviations from normal operations within the organizational boundaries.
Router, Wi-Fi, and remote services logs provide insights into failed login attempts and potentially malicious scanning actions.
System and application configuration settings and states, often at least partly reflected by configuration management databases (CMDBs), help to identify weak spots due to not required but running services, weak account credentials, or wrong patch levels.
Firewall, IDS, and antivirus logs and alerts point to probable causes, however, often with high false-positive rates that need to be verified.
Web browser histories, cookies, and caches are viable resources for forensic actions after something happened, to discover the root cause of a problem (e.g., the initial drive-by download).
Security information and event management systems (SIEMs) already provide correlated insights across machines and systems.
E-mail histories are essential sources to learn about and eventually counter (spear) phishing attempts and followed links to malicious sites.
Help desk ticketing systems, incident management/tracking systems, and people provide insights into any suspicious events and actions reported by humans rather than software sensors.
Forensic toolkits and sandboxing are vital means to safely analyze the behavior of untrusted programs without exposing a real corporate environment to any threat.
Indicators: are technical artifacts or observables that suggest an attack is imminent or is currently underway or that a compromise may have already occurred. Examples are IP addresses, domain names, file names and sizes, process names, hashes of file contents and process memory dumps, service names, and altered configuration parameters.
Tactics, techniques, and procedures (TTPs): characterize the behavior of an actor. A tactic is the highest-level description of this behavior, while techniques give a more detailed description of behavior in the context of a tactic, and procedures are an even lower-level, highly detailed description in the context of a technique. Some typical examples include the usage of spear phishing emails, social engineering techniques, websites for drive-by attacks, exploitation of operating systems and/or application vulnerabilities, the intentional distribution of manipulated USB sticks, and various obfuscation techniques. From these TTPs, organizations are able to learn how malicious attackers work and derive higher-level and generally valid detection, as well as remediation techniques, compared to quite specific measures based on just, often temporarily valid, indicators.
Threat actors: contain information regarding the individual or a group posing a threat. For example, information may include the affiliation (such as a hacker collective or a nation-state’s secret service), identity, motivation, relationships to other threat actors, and even their capabilities (via links to TTPs). This information is being used to better understand why a system might be attacked and work out more targeted and effective countermeasures. Furthermore, this type of information can be applied to collect evidences of an attack to be used in court.
Vulnerabilities: are software flaws that can be used by a threat actor to gain access to a system or network. Vulnerability information may include its potential impact, technical details, its exploitability, and the availability of an exploit, affected systems, platforms, and version, as well as mitigation strategies. A common schema to rate the seriousness of a vulnerability is the common vulnerability scoring schema (CVSS) (Scarfone and Mell 2009), which considers the enumerated details to derive a comparable metric. There are numerous web platforms that maintain lists of vulnerabilities, such as the common vulnerability and exposure (CVE) database from MITRE (https://cve.mitre.org/) and the national vulnerability database (NVD) (https://nvd.nist.gov/). It is important to notice that the impact of vulnerabilities usually needs to be interpreted for each organization (and even each system), individually, depending on the criticality of the affected systems for the main business processes.
Cybersecurity best practices: include commonly used cybersecurity methods that have demonstrated effectiveness in addressing classes of cyber threats. Some examples are response actions (e.g., patch, configuration change), recovery operations, detection strategies, and protective measures. National authorities, CERTs, and large industries frequently publish best practices to help organizations building up an effective cyber defense and rely on proven plans and measures.
Courses of actions (CoAs): are recommended actions that help to reduce the impact of a threat. In contrast to best practices, CoAs are very specific and shaped to a particular cyber issue. Usually CoAs span the whole incident response cycle starting with detection (e.g., add or modify an IDS signature), containment (e.g., block network traffic to command and control server), recovery (e.g., restore base system image), and protection from similar events in the future (e.g., implement multifactor authentication).
Tools and analysis techniques: this category is closely related to best practices, however focuses more on tools rather than procedures. Within a community it is desirable to align used tools to each other to increase compatibility, which makes it easier to import/export certain types of data (e.g., IDS rules). Usually there are sets of recommended tools (e.g., log extraction/parsing/analysis, editor), useful tool configurations (e.g., capture filter for network protocol analyzer), signatures (e.g., custom or tuned signatures), extensions (e.g., connectors or modules), code (e.g., algorithms, analysis libraries), and visualization techniques.
Interpreting, contextualizing, and correlating information from different categories allow the comprehension of the current security situation and provide therefore threat intelligence (Skopik 2017). Independently from the type, there are common desired characteristics to make cyber threat intelligence applicable. Specifically, CTI should be obtained timely, to allow security operation centers to promptly act; it should be relevant in order to be applicable to the pertaining operational environment; it should be accurate (i.e., correct, complete, and unambiguous) and specific, providing sufficient level of detail and context; finally, CTI is required to be actionable to provide or suggest effective course of action.
Future Directions for Research
The following section discusses the future directions for research in the fields of intrusion detection and CTI, as well as how both might profit from each other. In the field of intrusion detection, already a lot of research and development have been done in the direction of signature-based IDS, with focus on black-listing. But today’s evolving interconnection and the associated dependence on digital devices are the reason for sophisticated and tailored attacks, carried out by smart attackers who flexibly adjust their behavior according to the circumstances they face and therefore cannot be revealed and stopped by signature-based IDS and black-listing approaches which are only capable of revealing known attack vectors. Thus, the future research direction in cybersecurity related to big data tends toward anomaly-based intrusion detection and white-listing. The main challenges here will be reducing a usually high number of false positives that create a great effort for security analysts. Additionally, because networks grow fast and hardware devices and software versions change quickly, smart self-learning algorithms are required so that IDS can adapt themselves to new situations, without learning malicious system/network behavior and to keep the effort that has to be invested for configuration to a minimum.
Finally, to reduce the effort of dealing with the large and quickly evolving amounts of data concerning cybersecurity, novel smart big data approaches will be required that make use of collected CTI to support and steer the configuration and deployment of cybersecurity tools, and to automatize as many tasks as possible, since the amount of generated data and the number of cyber threats and attacks keep continuously growing.
- Axelsson S (2000) Intrusion detection systems: a survey and taxonomy. Technical reportGoogle Scholar
- Bianco D (2014) The pyramid of pain. detectrespond. Blogspot. com/2013/03/the-pyramid-of-pain.htmlGoogle Scholar
- Cannady J (1998) Artificial neural networks for misuse detection. In: National information systems security conference, pp 368–381Google Scholar
- Chairman of the Joint Chiefs of Staff (2013) Joint publication 2–0. Joint intelligence. Technical report. http://www.dtic.mil/doctrine/new_pubs/jp2_0.pdf
- Dainotti A, Pescapé A, Ventre G (2007) Worm traffic analysis and characterization. In: IEEE international conference on communications, ICC’07. IEEE, pp 1435–1442Google Scholar
- James P Anderson (1980) Computer security threat monitoring and surveillance. Technical report 17. James P. Anderson Company, Fort Washington, DCGoogle Scholar
- Kim J, Radhakrishnan S, Dhall SK (2004) Measurement and analysis of worm propagation on Internet network topology. In: Proceedings of 13th international conference on computer communications and networks, ICCCN 2004. IEEE, pp 495–500Google Scholar
- NIST (2016) Guide to cyber threat information sharing. Special publication 800–150. Technical reportGoogle Scholar
- OASIS (2017) Structured threat information expression v2.0. https://oasis-open.github.io/cti-documentation/
- Sabahi F, Movaghar A (2008) Intrusion detection: a survey. In: 3rd international conference on systems and networks communications, ICSNC’08. IEEE, pp 23–26Google Scholar
- Scarfone K, Mell P (2007) Guide to intrusion detection and prevention systems (IDPS), NIST special publication, vol 800. Department of Commerce, National Institute of Standards and Technology, Gaithersburg, p 94Google Scholar
- Scarfone K, Mell P (2009) An analysis of CVSS version 2 vulnerability scoring. In: Proceedings of the 2009 3rd international symposium on empirical software engineering and measurement. IEEE Computer Society, pp 516–525Google Scholar
- Skopik F (2017) Collaborative cyber threat intelligence: detecting and responding to advanced cyber attacks at the national level. CRC Press, Boca RatonGoogle Scholar
- Vacca JR (2013) Managing information security. Elsevier, Amsterdam/Boston/HeidelbergGoogle Scholar
- Whitman ME, Mattord HJ (2012) Principles of information security, Course technology, 4th edn. Cengage Learning, Stamford. Conn., oCLC: 930764051Google Scholar
- Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, CambridgeGoogle Scholar