The ability to reverse engineer a product has been important for as long as technology has existed. A vital activity in most branches of industrial design and production has been to acquire samples of the products sold by competing companies and pick them apart. Understanding the engineering done by your competing opponents can shed insight into the strengths and weaknesses of their products, reveal the engineering ideas behind their products’ features, and fertilize and further improve the innovation that goes on in one’s own company.

Within information and communications technology (ICT), reverse engineering has played a lesser role than it has in traditional industries. The main reason for this is that reverse engineering a product for the sake of copying it as your own is often considered too costly and time-consuming to be worth the effort. Still, reverse engineering is used and studied for a handful of select purposes in ICT. In this chapter, we provide an overview of the state of the art in this area and conclude how it relates to our ability to verify the contents of a product from an untrusted vendor.

6.1 Application of Reverse Engineering in ICT

The most common form of reverse engineering in computer programming and engineering is deeply rooted in the everyday tasks of the engineer. The making of a program or a piece of electronic equipment largely consists of interactions with libraries and components that one has not made oneself. Everybody who has worked as an engineer in ICT will know that the interfaces of such libraries and components are hard to understand and often lack the necessary documentation. Reverse engineering the interfaces of the components you need is therefore something that every ICT engineer will have spent time on [6].

Reverse engineering for the purpose of understanding interfaces is largely uncontroversial. The reasons for doing it are clear and they are generally compatible with the interests of society. An exception to this is when a company intentionally keeps its user interface secret for commercial reasons or for reasons related to security. Whether reverse engineering that interface is acceptable then becomes a legal as well as a moral question. In 1990, Sega Enterprises released a gaming console called Genesis. Its strategy was to let Sega and its licensed affiliates be the only developers of games for it. A California-based company called Accolade reverse engineered the interface of the Genesis gaming platform and successfully developed and sold games for it. In 1991, Accolade was sued by Sega for copyright infringement. The court ruled in Accolade’s favour because it had not copied any of Sega’s code and because of the public benefit of the additional competition in the market that Accolade represented [2]. Today – several decades later – secrecy regarding interfaces and challenges of such secrecy through reverse engineering still take place. The jailbreaking of mobile phones bears witness to this phenomenon [12].

Another area where reverse engineering is applied is in the search for security holes. The reason for wanting to do this is much the same as for needing to reverse the interface of a component you require. When you include a component in your product, you also expose yourself to the component’s security vulnerabilities. Reverse engineering the component will therefore be necessary to understand the extent to which such vulnerabilities exist and to assess the degree to which they are inherited in your own product [4]. This is particularly important for components that implement cryptographic security protocols. These protocols are usually mathematically sound but the security they provide is very sensitive to subtle mistakes or shortcuts in implementation.

The emergence of software reverse engineering as a field in its own right largely came about as the result of the need to analyse malware. Whenever a system is infected, there is a need to identify the malware involved, how it infects the system, how it spreads, and what damage it can do or has already done. There is therefore a high demand for expertise in reverse engineering different types of malware and conducting forensic analyses of infected systems [10].

The most controversial use of reverse engineering is related to digital rights management (DRM). The age of digital equipment has turned copyrighted material such as recorded music, books, and movies into digital information. On one hand, it has become trivially easy to copy, distribute, and share such material and, on the other hand, it has become increasingly hard for copyright owners to protect their property. The term DRM is used to denote technologies invented for the purpose of protecting copyright holders while still making copyrighted material easily accessible to those consumers who paid for the right to enjoy it [8]. Reverse engineering of DRM protection schemes will allow the cracker to gain unprotected access to the material. It is therefore regularly done by software pirates, who, in many cases, will make cracked material available on file-sharing sites. The material then becomes downloadable for free for anyone who chooses to do so and the value of the copyright diminishes accordingly.

Even though reverse engineering is more complex and therefore less used in ICT than in other fields of engineering, it has developed into an area in which a computer scientist can specialize. In the upcoming sections, we discuss some of the tools that have been developed in this area. Understanding the tools of the trade will help us understand the state of the art. Ultimately, this will help us assess the extent to which it is possible to fully investigate a product bought from an untrusted vender.

6.2 Static Code Analysis

In Chap. 3 we discussed the different layers of technology that constitute an ICT-system. These layers span from the user interface of a system and all the way down to the physical phenomena that allow us to build them in the first place. Then, in Chap. 4 we illustrated that many of these layers are hidden from the system developers themselves. Tools for hardware synthesis, compilers and assemblers makes the development process more efficient by relieving the engineers from having to relate to many of the technology layers.

A reverse engineer looking for malicious functionality inserted by an untrusted vendor will have to study all of the technology layers, as well as the interaction between them. The software-part of this technology stack starts from machine-code at the bottom. Here, none of the high-level notions known to programmers are present. Concepts such as variables, arrays, structures, objects, sets, list, trees, graphs, methods and procedures are not present. Rather, there are memory locations with a fixed number of bits in them, a clear distinction between registers close to the CPU and memory-locations in caches and off-chip, and no obvious distinction between data, pointers to memory-locations and instructions.

Since reverse engineering of software largely consists of recreating the intentions and thoughts of the initial programmer, the reverse engineer will have to backtrack from the machine code and towards the high-level program code that was originally written. This is a huge challenge. In its pure form, machine code loaded into memory can easily consist of millions of memory locations, all containing 32 or 64 bits of information. This has to be translated back – first into assembly code and then, possibly, into an interpreter of byte-level code – before the high-level concepts used by the original programmer can be recreated.

To most programmers inexperienced in reverse engineering, this sounds undoable. All software engineers have experienced not understanding high-level code written by their colleagues and most will admit to cases in which they did not understand code they wrote themselves two months ago. Understanding code starting from machine-level instructions seems like an impossible task. Still, reverse engineering has celebrated significant successes in deciphering programmer interfaces, malware analysis, as well as cracking DRM schemes. This is largely due to the tool sets available. In the upcoming sections, we review the most important classes of reverse engineering tools.

6.3 Disassemblers

A disassembler is a relatively simple piece of software. It takes a sequence of machine code instructions encoded in binary format readable by the machine and translates the instructions one by one into the more humanly readable textual form of assembly code. Since the instruction set differs from platform to platform, a disassembler is generally platform specific [5].

Although disassemblers vary in strength, it is generally agreed that automatically recreating readable assembly code from machine code is doable [6]. In our case, however, it should be noted that we do not necessarily trust the platform either. This means that the hardware and firmware could implement undocumented side effects and undocumented instructions that will not be correctly interpreted by the disassembler.

6.4 Decompilers

A decompiler is a piece of software that does the opposite of what a compiler does. This means that it tries to recreate the original source code by analysing the executable binary file. This is a very difficult problem. In all but a very few platforms, actual recovery of the full original source code with comments and variable names is impossible. Most decompilers are complete in the sense that they construct source code that, if recompiled, will be functionally equivalent to the original program. Still, the results they produce may be extremely hard to understand and that is where the limitation of decompilers lies with respect to reverse engineering.

There are several reasons why understanding decompiled code is far harder than understanding the original source code. First, in the compilation process, much of the information that the programmer writes into the code to make it readable is removed. Clearly, all comments to the code are removed. Furthermore, no variable names will survive, since they are translated into memory locations in machine code. High-level concepts such as classes, objects, arrays, lists, and sets will not be readily recreated. The programmer’s structuring of the code into methods and procedures may be removed by the compiler and calls for these procedures may have been replaced by copying the code in-line. Furthermore, the compiler may create new procedures through the observation of repeated code and the flow graph of the machine code may be very different from that of the original source code. In addition, all assignments containing arithmetical expressions will be replaced by the compiler by a highly optimized sequence of operations that renders the original assignment statement impossible to recreate [1].

These difficulties are such that some compare decompilation with the process of trying to bring back eggs from an omelette or the cow from a hamburger. This is true to the extent that the output from the decompiler will, in most cases, be far less readable than the original source code. On the other hand, it is important to note that all the information that makes the program do what it does will be present in the decompiled code as well. Decompilers are therefore valuable tools in most reverse engineering processes [6].

6.5 Debuggers

Whereas disassemblers and decompilers are tools that work on static program code, debuggers operate on code that is actively running. As the name suggests, the first debuggers were not intended for reverse engineering. Rather, they were tools intended to help programmers find programming mistakes.

A debugger allows a programmer to observe all actions of a program while it is running. Most of the actions of programs running outside of a debugger will be unobservable to the human eye. Usually, only user interactions are actually visible. The task of a debugger is to make all internal states and all internal actions of a program observable. A debugger can be instructed to stop the execution of a program at a given code line. When the specified code line is reached, the content of specific memory locations can be probed. The internal state of a program can thus be revealed at any point of the execution. Another useful feature of a debugger is that it will allow stepwise execution of the machine code. It is therefore possible to follow the flow of a program at a speed compatible with the speed of the human brain.

Debuggers are indispensable tools in most reverse engineering processes. They allow the reverse engineer to understand the control flow of a program, as well as how complex data structures are actually built and used. A debugger can therefore help fill the semantic void left by disassemblers and decompilers.

6.6 Anti-reversing

All instances of reverse engineering for the purposes of analysing malware, understanding undocumented interfaces, and cracking DRM schemes have one important thing in common: to reveal something that the original programmer intended to keep secret. Therefore, the development of reverse engineering schemes has run in parallel with the development of schemes trying to prevent reverse engineering.

Such anti-reversing schemes generally come in two flavours, where one is known as code obfuscation. The purpose of obfuscation techniques is to change the code into a representation that is semantically equivalent but where the structure of the code and data of the program are difficult to reconstruct. This can be achieved through the rearrangement of instructions, the insertion of irrelevant code, or the encryption of parts of the code. The arms race between the reverse engineers and obfuscators of code follows the same lines as that between malware makers and malware detectors and is largely being fought by the same people. Rather than giving a separate account for the race here, we refer to Chap. 7 and particularly to Sect. 7.4.

The second type of anti-reversing scheme consists of those that intend to render the tools of the reverse engineer useless. For all three tools discussed above – disassemblers, decompilers, and debuggers – there exist ways to confuse them. Disassemblers can be confused into interpreting data as instructions and instructions as data. This is particularly useful for architectures with variable-length instructions. Decompilers can be confused by machine code that cannot be reconstructed in the high-level language. One very simple example of this is the insertion of arbitrary unconditional jump statements into Java bytecode. Since Java does not have a goto statement, arbitrary unconditional jump statements are hard to decompile [3]. A more challenging way to confuse a decompiler is to exploit the lack of division between instructions and data that exist on most hardware platforms. In machine code, sequences of binary values can be computed in an arbitrarily complex way and, after they have been computed, they can be used as instructions. A decompiler will not be able to handle such situations because of the strong division between code and data that is assumed in most high-level languages.

Debuggers can be beat by having the program code detect that it is being run in debug mode and simply terminate if it finds that it is. The debugger will thus not be able to observe the dynamic behaviour of the code. The key to this approach is that it is generally impossible for a debugger to completely hide its presence. It will most often be visible in the set of processes running on the machine. Furthermore, to stop the execution of a program at arbitrary points, the debugger will have to change the program by inserting an interrupt instruction. Such changes can be detected by the program itself by calculating checksums on portions of its code. The arms race between debuggers and anti-debuggers has a counterpart in the dynamic detection of malware. This topic is discussed in Chap. 8 and the ways in which malware can detect that it is being observed are discussed in Sect. 8.5.

A combination of anti-reversing techniques can make the reverse engineering of code arbitrarily complex, even for the simplest program functionality. On the positive side, these anti-reversing techniques all come at a cost: they either make the code less efficient, longer, or both. Unfortunately for us, short and efficient code may not be important criteria for a dishonest vendor of digital equipment. The cost of implementing anti-reversing techniques is therefore not likely to help us.

6.7 Hardware

Reverse engineering a computer chip is, in many ways, an art form. A plethora of techniques are available for removing thin layers of material and then identifying the structure of logic gates that constitute the chip. An overview of some of the techniques is given by Torrance and James [11]. Based on these techniques, it is often stated that any integrated circuit can be reverse engineered, given sufficient resources. The key to understanding this statement lies in quantifying what is meant by the term sufficient resources.

Nowadays a chip can consist of hundreds of millions of logic gates spread over a number of metal layers that runs in the two digits. Each logic gate performs an extremely simple operation; thus, the complex operation of a chip is a product of the interactions between these gates. There is currently no mature methodology that can produce high-level concepts from such a set of gate-level designs. Actually, finding a word-level structure from bit-level gates is still considered a difficult problem and, even when that problem is solved, we are very far from having understood a complete chip [7]. Fully reverse engineering a modern complex chip to the extent that all details of its operation are understood, down to the impact of every bit-level gate, is practically impossible, given the amount of effort it would require.

However, this may change in the future. As discussed above, strong tool sets are readily available for building high-level structures from low-level software and it is reasonable to assume that similar advances can be achieved for hardware. On the other hand, the development of such tools for hardware will clearly lead to the development of hardware obfuscation techniques as well. The possible future balance of power between hardware reverse engineers and hardware obfuscators is hard to predict. Still, a reasonable guess is that it will converge to a state similar to the balance of power found in the software domain. If this happens, hardware obfuscation techniques will reach a state where reverse engineering can be made arbitrarily complex but not theoretically impossible.

6.8 Discussion

Reverse engineering has played vital roles in most areas of engineering. It is used to understand the technical ideas behind competing products and the ease with which a product can be reverse engineered has been a driver behind such legal institutions as patents. In most fields, it is considered nearly impossible to include an idea in a product without revealing that idea to anyone who picks the product apart. Stating that ships, cars, buildings, and bridges have not been built according to specifications has also been the basis of legal claims.

In ICT, however, things are different. Implementing an idea into a product can often be done without disclosing the idea itself and, to support this, obfuscation techniques have been developed that make it even more difficult to extract engineering ideas from analysing a product. One consequence is that patents have played a lesser role in ICT equipment than could be expected from the technical complexity involved. Finding out whether a patented idea has been copied into a competing product is often a highly non-trivial task in itself.

The importance of reverse engineering in ICT is nevertheless evident. Major successes have been celebratedFootnote 1 by reverse engineering teams in all important application areas. Programmer interfaces have been reversed to allow for the correct use and inclusion of components, the reverse engineering of malware has allowed us to better protect ourselves, the reverse engineering of DRM schemes has changed the course of entire industries, and the reverse engineering of cryptographic protocols has revealed weaknesses to be exploited or removed.

All of the successes do have one thing in common. They relate to relatively small pieces of code or the reverse engineering team was able to narrow the focus of the effort down to a sufficiently limited code area to make it tractable. For our case, this will not suffice. Depending on the intent of the dishonest vendor, the unwanted functionality can be placed anywhere in the product. Kill switches can be placed anywhere in the hundreds of millions of transistors on a given chip. They can be placed anywhere in firmware of a CPU so that it is rendered useless when a given combination of machine instructions are executed. As argued in Chap. 3, a kill switch can be placed anywhere in the operating system – in the device drivers, the hypervisors, the bytecode interpreters, the dynamic link libraries, or in an application itself – and, as explained in Chap. 4, can be introduced by any development tool used by the developers.

Morrison and colleagues and colleagues estimated that a full analysis of the Windows code base should take between 35 and 350 person–years, even if the source code is available to the reverse engineers [9]. Knowing the Windows operating system is only a small part of the total technology stack and that the expected lifetime of this code base is only a handful of years, it becomes evident that the state of the art in reverse engineering falls far short of being a satisfactory answer to the problem of untrusted vendors. It is, however, unlikely that reverse engineering will not play a central role in the future of this problem. Reverse engineering is and will remain the field that most directly addresses the core of our problem.