In the search for research fields that can shed light on our issue of checking a piece of equipment for unwanted functionality, static malware detection stands out as the most obvious candidate. Malware detection is as old as malware itself and its main goal is to discover if maliciously behaving code has been introduced into an otherwise clean system by a third party. In this chapter, we consider techniques that are static, in the sense that they are based on investigating the code rather than a running system. We will return to dynamic methods in a later chapter.

Our point of view is somewhat different from that of classic malware detection, in that we do not assume that we have a clean system to begin with. Our intention is to shed light on the implications of this difference to understand the extent to which the successes of static malware detection can transfer to our case.

7.1 Malware Classes

The most frequently cited definition of malware is software that fulfils the deliberately harmful intent of an attacker. This definition was presented by Moser et al. [11]. The term malware is usually understood to be an abbreviation of the two words malicious software. As we shall see in a later section, this definition is overly restrictive, since some types of malware threaten hardware development. A more general definition and one more relevant to our topic is, therefore, that malware is malicious code, regardless of whether this code defines hardware or if it is to be run as software. A reasonable definition of the term is therefore ‘code that deliberately fulfils the harmful intent of an attacker’.

An abundance of taxonomies of malware are to be found in the literature, but they generally agree on the most important terms. We do not intend to give a complete overview here; rather, we concentrate on the notions that are most relevant to our topic. A more complete set of definitions can be found in [17]. Here, we proceed with the three notions that define different ways that malware can spread and reside on a machine.

 

Virus:

A computer virus shares the property of its biological counterpart that it cannot live on its own. Rather, it is a piece of code that inserts itself into an existing program and is executed whenever the host program is executed. Computer viruses spread by inserting themselves into other executables. The initial infection of the system can be accomplished through a program that only needs to be run once. The infecting program could, for example, reside on a memory stick.

Worm:

A worm is a complete program in its own right and can execute independently of any other program. Its primary distinction from a virus is that it does not need a host program. This also means that its strategies for spreading will be different, since it does not need to alter existing executables to spread. It can spread through a network by exploiting vulnerabilities in operating systems.

Trojan:

While viruses and worms spread in stealth mode, a Trojan horse is malware embedded into a seemingly innocent application that is explicitly and knowingly downloaded and run by the user. This application can be a screensaver, a small widget that displays the local weather, or a file received as a seemingly harmless attachment in e-mail. Infections embedded in malicious webpages are also categorised as Trojans.

 

Although the above categorization gives the impression that an attack falls into exactly one of these categories, this is not generally true. A sophisticated operation could take advantage of all three strategies above.

Orthogonal to the infection methods above is a set of notions related to what the malware is trying to achieve.

 

Spyware:

The task of spyware is to collect sensitive information from the system it resides on and transfer this information to the attacker. The information can be gathered by logging keystrokes on a keyboard, analysing the contents of documents on the system, or analysing the system itself in preparation for future attacks.

Ransomware:

As the name suggests, this is malware that puts the attacker in a position to require a ransom from the owner of the system. The most frequent way to do this is by rendering the system useless through encrypting vital information and requiring compensation for making it available again.

Bot:

A bot is a piece of software that gives the attacker—or botmaster—the ability to remotely control a system. Usually a botmaster has infected a large number of systems and has a set of machines—a botnet—under his or her control. Botnets are typically used to perform attacks on other computers or to send out spam emails.

Rootkit:

A rootkit is a set of techniques are used to mask the presence of malware on a computer, usually through privileged access—root or administrator access—to the system. Rootkits are not bad per se, but they are central parts of most sophisticated attacks. They are also typically hard to detect and remove, since they can subvert any anti-malware program trying to detect it.

 

This list of actions that could be performed by malware covers the most frequent motivations for infecting a system. Still, we emphasize that the list is not exhaustive. Other motivations not only are conceivable but also have inspired some of the most spectacular digital attacks known to date. The most widely known of these is Stuxnet, whose prime motivation was to cause physical harm to centrifuges used in the enrichment of uranium in Iran [10]. Another example is Flame [3], which can misuse the microphone and camera of an infected device to record audio and video from the room where the infected system is physically located.

7.2 Signatures and Static Code Analysis

Checking for malicious intent in program code is usually done through signatures. In its simplest and earliest form, a signature is a sequence of assembly instructions that is known to perform a malicious act. Two decades of arms race between makers and detectors of malware have led to the development of malware that is hard to detect and advanced static signatures with complex structures. The utilization of such signatures is, in principle, quite straightforward: we need a repository of known sequences of instructions sampled from all known malware. Checking code against this repository, a malware detection system would be able to raise the alarm when a matching sequence is found.

There are basically three challenges to finding malware this way. First, the signature has to be generated and this is usually done manually [8]. Second, before the signature can be generated, the malware must have been analysed. This will not happen until its existence is known. There are examples of malware that were active for several years before they were found [19]. Finally, the repository of signatures is ever growing and new signatures have to be distributed continuously.

These challenges notwithstanding, the detection of malware through static signatures has historically been one of the most successful countermeasures against malware infections. The arms race between detectors and developers of malware is, however, still ongoing and, in the upcoming sections, we give an overview of how the race has played out.

7.3 Encrypted and Oligomorphic Malware

The response of malware developers to signatures was quite predictable. The developers needed to make malware that had the same functionality as malware for which a signature existed but where the signature itself would not produce a match. This was important for them for two reasons. First, in writing new malware, it is important that it is not caught by existing signatures. Second, as one’s malware spreads and infects more and more machines, one would like it to automatically develop into different strands. This way, whenever new signatures can fight some instances of your malware, there are others that are immune.

An early attempt at making a virus develop into different versions as it spread involved encrypting the part of the code that performed the malicious actions. Using different encryption keys, the virus could morph into seemingly unrelated versions every other generation. For this to work, the virus had to consist of two parts, one part being a decryptor that decrypts the active parts of the malware and the other the malicious code itself. Although this made the static analysis of the actions of the malware somewhat harder, finding the malware by using signatures was not made any more difficult. The decryption loop itself could not be encrypted and it turned out that finding a signature that matched a known decryption loop was no more difficult than finding a signature for a non-evolving virus.

A second approach was to embed several versions of the decryption loop into the encrypted part of the malware. For each new generation of the virus, an arbitrary decryption loop is chosen so that one single signature will not be able to detect all generations of the malware. Viruses that use this concealment strategy are called oligomorphic [15] and they present a somewhat greater challenge for virus analysers, which will have to develop signatures for each version of the decryption loop. Still, for virus detection software, only the analysis time is increased. Oligomorphic viruses are therefore currently considered tractable.

7.4 Obfuscation Techniques

From the point of view of a malware developer, one would want to overcome oligomorphic viruses’ weakness of using only a limited number of different decryption loops. The natural next step in the evolution of viruses was to find ways to make the code develop into an unlimited number of different versions.

In searching for ways to do this, malware developers had strong allies. Parts of the software industry had for some time already been developing ways to make code hard to reverse engineer, so that they could better protect their intellectual property. Rewriting code to have the same functionality but with a vastly different appearance was therefore researched in full openness. Some of the methods developed naturally found their way into malware development. Many techniques could be mentioned [20], but here we only consider the most common ones.

The most obvious thing to do when a signature contains a sequence of instructions to be performed one after the other is to insert extra insignificant code. This obfuscation method is called dead code insertion and consists of arbitrarily introducing instructions that do not alter the result of the program’s execution. There are several ways of doing this. One can, for instance, insert instructions that do nothing at all—so-called nooperations—and these are present in the instruction sets of most processors. Another method is to insert two or more operations that cancel each other out. An example of the latter is two instructions that push and pop the same variable on a stack. Another obfuscation technique is to exchange the usage of variables or registers between instances of the same malware. The semantics of the malware would be the same, but a signature that detects one instance will not necessarily detect the other.

More advanced methods will make more profound changes to the code. A key observation is that, in many situations, multiple instructions will have the same effect. An example is when you want to initialize a register to zeros only: you could do so by explicitly assigning a value to it or by XOR-ing it with itself. In addition, one can also alter the malware by scattering code around and maintaining the control flow through jump instructions.

The most advanced obfuscations techniques are the so-called virtualization obfuscators [16]. Malware using this technique programs malicious actions in a randomly chosen programming language. The malware contains an interpreter for this language and thus performs the malicious acts through the interpreter.

In parallel with the development of obfuscation techniques, we have seen an abundance of suggestions for deobfuscators. These are tasked with transforming the obfuscated code into a representation that is recognizable to either humans or a malware detector equipped with a signature. For some of the obfuscation techniques above, deobfuscators are easy to create and efficient to use. The successes of these techniques unfortunately diminish when obfuscators replace instructions with semantically identical instructions where the semantic identity is dependent on the actual program state or when the control flow of the program is manipulated with conditional branches that are also dependent on the program state. This should, however, not come as a surprise. We learned in Chap. 5 that whether two programs are behaviourally identical is undecidable. Perfect deobfuscators are therefore impossible to design.

The hardest challenge in deobfuscation is to extract the meaning of code that has been through virtualization obfuscation. The first step in doing this would have to be to reverse engineer the virtual machine, to get hold of the programming language that was used in the writing of the malicious code. The complexity of this task becomes clear when we consider the following two facts. First, the virtual machine may itself have been obfuscated through any or all of the mechanisms mentioned above. Second, many different programming paradigms have strength of expression equal to that of a Turing machine. Logic programming, functional programming, and imperative programming are all considered in Sect. 9.3—but, in addition, we have algebraic programming [6] and Petri nets [13], to mention two of the more important. All of these paradigms can be implemented in a programming language in many different ways. Analysing the virtual machine itself is a task that can be made arbitrarily complex and the analysis must be completed before one can start analysing the operational part of the malware. This is a clear indication that we have a long way to go before the static analysis of programming code can help us against a malicious equipment vendor.

7.5 Polymorphic and Metamorphic Malware

Given the weakness of oligomorphic malware and the obfuscation techniques described above, the next step in the development of advanced viruses should be obvious. A polymorphic virus is an encrypted virus that uses obfuscation techniques to generate an unlimited number of versions of its decryption loop. A well-designed polymorphic virus can thus not be fought by finding signatures for the decryptor. These viruses are therefore fought through deep analysis of one version of the decryptor so that the decryption key can be extracted. Thereafter, the body of the virus is decrypted and matched with an ordinary signature. Although polymorphic viruses require a great deal of human effort in their analysis, their automatic detection need not be too computationally heavy once analysed.

Metamorphic viruses are the most challenging. They are not necessarily based on encryption and, instead, use obfuscation techniques throughout the entire body of the virus. This means that each new copy of the virus may have a different code sequence, structure, and length and may use a different part of the instruction set. Since obfuscation techniques have to be executed automatically from one generation of the virus to the next, a metamorphic virus must carry out the following sequence of operations to mutate successfully:

  1. 1.

    Identify its own location in storage media.

  2. 2.

    Disassemble itself to prepare for analysis of the code.

  3. 3.

    Analyse its own code, with little generic information passed along, since this information could be used in signature matching.

  4. 4.

    Use obfuscation techniques to transform its own code based on the analysis above.

  5. 5.

    Assemble the transformed code to create an executable for the new generation.

Efficient static methods for fighting metamorphic virus have yet to be developed [14]. The fact that no two versions of them need share any syntactic similarities makes the task hard and it is made even harder by the fact that some of the viruses morph into different versions every time they run, even on the same computer.

7.6 Heuristic Approaches

Looking for malicious code through signatures has the obvious drawback that, for a signature to exist, the malicious code has to be analysed in advance [2]. This also means that the malware has to be known in advance. In the problem we are studying, this is rarely the case. If the malware were already known, we would know it had been inserted; thus, we would already know that the vendor in question was not to be trusted. We need to search for unknown code with malicious functionality and we therefore need to approach malware detection differently. Heuristic malware detection tries to do so by identifying features of the code where one can expect there to be differences in the occurrence of that feature, depending on whether the code is malicious or benign. The code in question is analysed for the features in question and a classification algorithm is used to classify the code as either malicious or benign.

The first classes of features that were considered were N-grams [1]. An N-gram is a code sequence of length N, where N is a given number. Although an N-gram, at first glance, looks exactly like a very simple signature, there are crucial differences. First, N is often a very low number, so the N-gram is very short in comparison with a signature. Second, unlike for signatures, we are not interested in the mere question of whether there is a match or not; rather, we are interested in how many matches there are. Heuristic methods based on N-grams extract a profile of how a set of N-grams occurs in the code under investigation. This profile is classified as either benign or malicious by a classifier. The complexity of classifiers varies greatly, from simple counts of the occurrence of features to advanced machine learning techniques.

Other heuristic approaches use so-called opcodes instead of N-grams [4]. An opcode is the part of an assembly instruction that identifies the operation itself but without the part that identifies the data on which it operates. The techniques that can be used for classifiers are more or less the same as those used for N-grams.

A final class of features worth mentioning is that based on control flow graphs [5]. A control flow graph in its simplest form is a directed graph whose nodes represent the statements of the program and the edges the flow of program control. From this graph, several features can be extracted, such as nodes, edges, subgraphs, and simplified subgraphs with collapsed nodes.

Heuristic approaches have had some significant success. Still, static versions of these have one major limitation when applied to our problem: since we can assume that a dishonest equipment vendor is well aware of the state of the art in heuristic analysis, we can also assume that the vendor has made an effort to develop code that will be wrongly classified. Given the flexibility of the code obfuscation techniques described above, this is unfortunately not very difficult to do [12]. For this reason, present research and commercial anti-malware products favour dynamic heuristics [7]. We return to this topic in the next chapter.

7.7 Malicious Hardware

The intense study of malicious software that has taken place over several decades has been mirrored in the hardware domain only to a limited extent. For a long time, this situation was a reasonable reflection of the state of threats. The development and manufacture of hardware components were assumed to be completely controlled by one company and it was not suspected that any development team would deliberately insert unwanted functionality in the chips.

Both of these assumptions have now become irrelevant. Indeed, one topic of this book is exactly that of hardware vendors inserting unwanted functionality. Furthermore, the process of developing integrated circuits now involves many development teams from different companies. Putting together a reasonably advanced application-specific integrated circuit (ASIC) now largely consists of exactly that: putting together blocks of logic from different suppliers. These blocks can be simple microprocessors, microcontrollers, digital signal processors, or network processors. Furthermore, as we saw in Chap. 4, Trojans can be inserted through the design tools and in the fabrication as well [18].

Static analysis of ASICs is conducted in industry for a variety of reasons. The state of the art in the field is discussed in Sect. 6.7 to the extent that it is relevant to our discussions. In addition to full static analysis of the chip, several approaches require the execution of hardware functionality. For these methods, we refer the reader to Sect. 8.7.

7.8 Specification-Based Techniques

The most intuitively appealing approach to detecting malware inserted by an equipment vendor is to start with a specification of what the system should do. Thereafter, one analyses whether the system does only this or if it does something else in addition. This approach is very close to what specification-based malware detection takes. Specification-based malware detection comprises a learning phase, where a set of rules defining valid behaviour is obtained. The code is then examined to assess if it does only what is specified.

The main limitation of specification-based techniques is that a complete and accurate specification of all valid behaviours of a system is extremely work intensive to develop, even for moderately complex systems [9]. The amount of results in this area is therefore limited.

7.9 Discussion

The static detection of malware has had many success stories. In particular, early virus detection software was based almost exclusively on static detection. As the arms race between malware writers and malware detectors has progressed, we have unfortunately reached a situation in which static detection is no longer effective on its own. Obfuscation techniques have significantly reduced the value of signatures and static heuristic approaches have not been able to close this gap.

The problem becomes even worse when we focus on dishonest equipment vendors rather than third-party attackers. All static methods require a baseline of non-infected systems for comparison. The whole idea behind signature-based malware detection is that it detects a previously known and analysed piece of malware and this malware is not present in non-infected systems. If you want to check whether a vendor inserted malware into a system before you buy it, the malware will not be known and analysed, and there will not be a non-infected system for comparison. This means that the analysis will have to encompass the entire system. We return to a discussion of the tractability of this task in Sect. 10.10. Heuristic methods will suffer from the same shortcoming: there is no malware-free baseline with which heuristic methods can train their classifier.

Even after having painted this bleak picture, there is still hope in the further development of static approaches. We have argued that full deobfuscation is very hard and often an impossible task. Still, it is possible to detect the existence of obfuscated code to some extent. One approach is therefore to agree with the vendor that the code used in your equipment will never be obfuscated. The problem with this is that obfuscation is used for many benign purposes as well. In particular, it is used for the protection of intellectual property. The balance between the benefits and drawbacks of obfuscation in the formation of trust between customers and vendors of ICT equipment needs further investigation before one can conclude whether banning obfuscation is a feasible way forward.

What appears to be the most promising way forward for static approaches is a combination of specification-based techniques and proof-carrying code, which we will elaborate upon further in Sect. 9.8. Specification-based techniques have not been subject to the same amount of attention as the other techniques. Still, for our problem, it has one big advantage over the other methods: it does not require the existence of a clean system and it does not require the malware to have been identified and analysed beforehand. Proof-carrying code has the drawback of being costly to produce. Still, efforts in this area so far have been to provide proof that the code is correct. Our purpose will be somewhat different, in that we want to make sure that the code does not contain unwanted security-related functionality. Although this is not likely to make all problems go away, the combination of controlling the use of obfuscation, applying specification-based techniques, and requiring proof-carrying code on critical components has the potential to reduce the degrees of freedom for a supposedly dishonest equipment vendor.

In recent years, malware detection has been based on a combination of static methods such as those discussed in this chapter and dynamic methods based on observing the actions of executing code. Such dynamic methods are discussed in the next chapter.