1 Introduction

Symbolic execution [1] has become an increasingly important technique for automated software analysis, e.g., generating test cases, finding bugs, and detecting security vulnerabilities [2,3,4,5,6,7,8,9,10,11]. There have been many recent approaches to symbolic execution [12,13,14,15,16,17,18,19,20,21,22]. Generally speaking, these approaches can be classified into two categories: online symbolic execution (e.g., BitBlaze [4], klee [5], and \(\textsc {s}^2\textsc {e}\) [6]), and concolic execution (a.k.a., offline symbolic execution, e.g., CUTE [2], DART [3], and SAGE [7]). Online symbolic execution closely couples Symbolic Execution Engines (see) with the System Under Test (sut) and explore all possible execution paths of sut online at once. On the other hand, concolic execution decouples see from the sut through traces, which concretely runs a single execution path of a sut and then symbolically executes it.

Both online and offline symbolic execution are facing new challenges, as computer software is experiencing an explosive growth, both in complexities and diversities, ushered in by the proliferation of cloud computing, mobile computing, and Internet of Things. Two major challenges are: (1) the sut involves many types of software for different hardware platforms and (2) the sut involves many components distributed on different machines and as a whole the sut cannot fit in any see. In this paper, we focus on how to extend concolic execution to satisfy the needs for analyzing emerging software systems. There are two major observations behind our efforts on extending concolic execution:

  • The decoupled architecture of concolic execution provides the flexibility in integrating new trace-captured frontends for emerging platforms.

  • The trace-based nature of concolic testing offers opportunities for selectively capturing and synthesizing reduced system-level traces for scalable analysis.

We present crete, a versatile binary-level concolic testing framework, which features an open and highly extensible architecture allowing easy integration of concrete execution frontends and symbolic execution backends. crete’s extensibility is rooted in its modular design where concrete and symbolic execution is loosely coupled only through standardized execution traces and test cases. The standardized execution traces are llvm-based, self-contained, and composable, providing succinct and sufficient information for see to reproduce the concrete executions. The crete framework is composed of:

  • A tracing plugin, which is embedded in the concrete execution environment, captures binary-level execution traces of the sut, and stores the traces in a standardized trace format.

  • A manager, which archives the captured execution traces and test cases, schedules concrete and symbolic execution, and implements policies for selecting the traces and test cases to be analyzed and explored next.

  • A replayer, which is embedded in the symbolic execution environment, performs concolic execution on captured traces for test case generation.

We have implemented the crete framework on top of qemu [23] and klee, particularly the tracing plugin for qemu, the replayer for klee, and the manager that coordinates qemu and klee to exchange runtime traces and test cases and manages the policies for prioritizing runtime traces and test cases. To validate crete extensibility, we have also implemented a tracing plugin for the 8051 emulator [24]. The trace-based architecture of crete has enabled us to integrate such tracing frontends seamlessly. To demonstrate its effectiveness and capability, we evaluated crete on GNU Coreutils programs and TianoCore utility programs for UEFI BIOS, and compared with klee and angr, which are two state-of-art open-source symbolic executors for automated program analysis at source-level and binary-level.

The crete framework makes several key contributions:

  • Versatile concolic testing. crete provides an open and highly extensible architecture allowing easy integration of different concrete and symbolic execution environments, which communicate with each other only by exchanging standardized traces and test cases. This significantly improves applicability and flexibility of concolic execution to emerging platforms and is amenable to leveraging new advancements in symbolic execution.

  • Standardizing runtime traces. crete defines a standard binary-level trace format, which is llvm based, self-contained and composable. Such a trace is captured during concrete execution, representing an execution path of a sut. It contains succinct and sufficient information for reproducing the execution path in other program analysis environment, such as for symbolic execution. Having standardized traces minimizes the need of converting traces for different analysis environment and provides a basis for common trace-related optimizations.

  • Implemented a prototype. We have implemented crete with klee as the see backend and multiple concrete execution frontends such as qemu and 8051 Emulator. crete achieved comparable code coverage on Coreutils binaries as klee directly analyzing at source-level and generally outperformed angr. crete also found 84 distinct and previously-unreported crashes on widely-used and extensively-tested utility programs for UEFI BIOS development. We also make crete implementation publicly available to the community at github.com/SVL-PSU/crete-dev.

2 Related Work

DART [3] and CUTE [2] are both early representative work on concolic testing. They operate on the source code level. crete further extends concolic testing and targets close-source binary programs. SAGE [7] is a Microsoft internal concolic testing tool that particularly targets at X86 binaries on Windows. crete is platform agnostic: as long as a trace from concrete execution can be converted into the llvm-based trace format, it can be analyzed to generate test cases.

klee [5] is a source-level symbolic executor built on the llvm infrastructure [25] and is capable of generating high-coverage test cases for C programs. crete adopts klee as its see, and extends it to perform concolic execution on standardized binary-level traces. \(\textsc {s}^2\textsc {e}\) [6] provides a framework for developing tools for analyzing close-source software programs. It augments a Virtual Machine (vm) with a see and path analyzers. It features a tight coupling of concrete and symbolic execution. crete takes a loosely coupled approach to the interaction of concrete and symbolic execution. crete captures complete execution traces of the sut online and conducts whole trace symbolic analysis off-line.

BitBlaze [4] is an early representative work on binary analysis for computer security. It and its follow-up work Mayhem [8] and MergePoint [12] focus on optimizing the close coupling of concrete and symbolic execution to improve the effectiveness in detecting exploitable software bugs. crete has a different focus on providing an open architecture for binary-level concolic testing that enables flexible integration of various concrete and symbolic execution environments.

angr [14] is an extensible Python framework for binary analysis using VEX [26] as an intermediate representation (IR). It implemented a number of existing analysis techniques and enabled the comparison of different techniques in a single platform. angr needs to load a sut in its own virtual environment for analysis, so it has to model the real execution environment for the sut, like system calls and common library functions. crete, however, performs in-vivo binary analysis, by analyzing binary-level trace captured from unmodified execution environment of a sut. Also, angr needs to maintain execution states for all paths being explored at once, while crete reduces memory usage dramatically by analyzing a sut path by path and separates symbolic execution from tracing.

Our work is also related to fuzz testing [27]. A popular representative tool for fuzzing is AFL [28]. Fuzzing is fast and quite effective for bug detection; however, it can easily get stuck when a specific input, like magic number, is required to pass a check and explore new paths of a program. Concolic testing guides the generation of test cases by solving constraints from the source code or binary execution traces and is quite effective in generating complicated inputs. Therefore, fuzzing and concolic testing are complementary software testing techniques.

3 Overview

During the design of the crete framework for binary-level concolic testing, we have identified the following design goals:

  • Binary-level In-vivo Analysis. It should require only the binary of the sut and perform analysis in its real execution environment.

  • Extensibility. It should allow easy integration of concrete execution frontends and see backends.

  • High Coverage. It should achieve coverage that is not significantly lower than the coverage attainable by source-level analysis.

  • Minimal Changes to Existing Testing Processes. It should simply provide additional test cases that can be plugged into existing testing processes without major changes to the testing processes.

To achieve the goals above, we adopts an online/offline approach to concolic testing in the design of the crete framework:

  • Online Tracing. As the sut is concretely executed in a virtual or physical machine, an online tracing plugin captures the binary-level execution trace into a trace file.

  • Offline Test Generation. An offline see takes the trace as input, injects symbolic values and generates test cases. The new test cases are in turn applied to the sut in the concrete execution.

This online tracing and offline test generation process is iterative: it repeats until all generated test cases are issued or time bounds are reached. We extend this process to satisfy our design goals as follows.

  • Execution traces of a sut are captured in its unmodified execution environment on binary-level. The tracing plugin can be an extension into a vm (Sect. 4.1), a hardware tracing facility, or a dynamic binary instrumentation tool, such as PIN [29], and DynamoRIO [30].

  • The concrete and symbolic execution environments are decoupled by standardized traces (Sect. 4.2). As long as they can generate and consume standardized traces, they can work together as a cohesive concolic process.

  • Optimization can be explored on both tracing and test case generation, for example, selective binary-level tracing to improve scalability (Sect. 4.3), and concolic test generation to reduce test case redundancy (Sect. 4.4). This makes high-coverage test generation on binary-level possible.

  • The tracing plugin is transparent to existing testing processes, as it only collects information. Therefore, no change is made to the testing processes.

4 Design

In this section, we present the design of crete with a vm as the concrete execution environment. The reason for selecting a vm is that it allows complete access to the whole system for tracing runtime execution states and is generally accessible as mature open-source projects.

Fig. 1.
figure 1

crete architecture

4.1 crete Architecture

As shown in Fig. 1, crete has four key components: Runner, a tiny helper program executing in the guest OS of the vm, which parses the configuration file and launches the target binary program (tbp) with the configuration and test cases; Tracer, a comprehensive tracing plug-in in the vm, which captures binary-level traces from the concrete execution of the tbp in the vm; Replayer, an extension of the see, which enables the see to perform concolic execution on the captured traces and to generate test cases; Manager, a coordinator that integrates the vm and see, which manages runtime traces captured and test cases generated, coordinates the concrete and symbolic execution in the vm and the see, and iteratively explores the tbp.

crete takes a tbp and a configuration file as inputs, and outputs generated test cases along with a report of detected bugs. The manual effort and learning curve to utilize crete are minimal. It makes virtually no difference for users to setup the testing environment for the tbp in a crete instrumented vm than a vanilla vm. The configuration file is an interface for users to configure parameters on testing a tbp, especially specifying the number and size of symbolic command-line inputs and symbolic files for test case generation.

4.2 Standardized Runtime Trace

To enable the modular and plug-and-play design of crete, a standardized binary-level runtime trace format is needed. A trace in this format must capture sufficient information from the concrete execution, so the trace can be faithfully replayed within the see. In order to integrate a concrete execution environment to the crete framework, only a plug-in for the environment needs to be developed, so that the concrete execution trace can be stored in the standard file format. Similarly, in order to integrate a see into crete, the engine only needs to be adapted to consume trace files in that format.

We define the standardized runtime trace format based on the llvm assembly language [31]. The reasons for selecting the llvm instruction sets are: (1) it has become a de-facto standard for compiler design and program analysis [25, 32]; (2) there have been many program analysis tools based on llvm assembly language [5, 33,34,35]. A standardized binary-level runtime trace is packed as a self-contained llvm module that is directly consumable by a llvm interpreter. It is composed of (1) a set of assembly-level basic blocks in the format of llvm functions (2) a set of hardware states in the format of llvm global variables (3) a set of crete-defined helper functions in llvm assembly (4) a main function in llvm assembly. The set of assembly-level basic blocks is captured from a concrete execution of a tbp. It is normally translated from another format (such as qemu-ir) into llvm assembly, and each basic block is packed as a llvm function. The set of hardware states are runtime states along the execution of the tbp. It consist of CPU states, memory states and maybe states of other hardware components, which are packed as llvm global variables. The set of helper functions are provided by crete to correlate captured hardware states with captured basic blocks, and open interface to see. The main function represents the concrete execution path of the tbp. It contains a sequence of calls to captured basic blocks (llvm functions), and calls to crete-defined helper functions with appropriate hardware states (llvm global variables).

Fig. 2.
figure 2

Example of standardized runtime trace

An example of a standardized runtime trace of crete is listed in Fig. 2. The first column of this figure is a complete execution path of a program with given concrete inputs. It is in the format of assembly-level pseudo-code. Assuming the basic blocks BB_1 and BB_3 are of interest and are captured by crete Tracer, while other basic blocks are not (see Sect. 4.3 for details). As shown in the second and third column of the figure, hardware states are captured in two categories, initial state and side-effects from basic blocks not being captured. As shown in the forth column of the figure, captured basic blocks are packed as llvm functions, and captured hardware states are packed as llvm global variables in the standardized trace. A main function is also added making the trace a self-contained llvm module. The main function first invokes crete helper functions to initialize hardware states, then it calls into the first basic block llvm function. Before it calls into the second basic block llvm function, the main function invokes crete helper functions to update hardware states. For example, before calling asm_BB_3, it calls function sync_state to update register r1 and memory location 0x5678, which are the side effects brought by BB_2.

4.3 Selective Binary-Level Tracing

A major part of a standardized trace is assembly-level basic blocks which are essentially binary-level instruction sequences representing a concrete execution of a tbp. It is challenging and also unnecessary to capture the complete execution of a tbp. First, software binaries can be very complex. If we capture the complete execution, the trace file can be prohibitively large and difficult for the see to consume and analyze. Second, as the tbp is executing, it is very common to invoke many runtime libraries (such as libc) of no interest to testers. Therefore, an automated way of selecting the code of interest is needed.

crete utilizes Dynamic Taint Analysis (DTA) [36] to achieve selective tracing. The DTA algorithm is a part of crete Tracer. It tracks the propagation of tainted values, normally specified by users, during the execution of a program. It works on binary-level and in byte-wise granularity. By utilizing the DTA algorithm, crete Tracer only captures basic blocks that operate on tainted values, while only capturing side-effects from other basic blocks. For the example trace in Fig. 2, if the tainted value is from user’s input to the program and is stored at memory location 0x1234, DTA captures basic block BB_1 and BB_3, because both of them operate on tainted values, while the other two basic blocks do not touch tainted values, and are not captured by DTA.

crete Tracer captures the initial state of CPU by capturing a copy of the CPU state before the first interested basic block is executed. The initial CPU state is normally a set of register values. As shown in Fig. 2, the initial CPU state is captured before instruction (1). Näively, the initial memory state can be captured in the same way; however, the typical size of memory makes it impractical to dump entirely. To minimize the trace size, crete Tracer only captures the parts of memory that are accessed by the captured read instructions, like instruction (1) and (9). The memory being touched by the captured write instructions, like instruction (3) and (11), can be ignored because the state of this part of the memory has been included in the write instructions and has been captured. As a result, crete Tracer monitors every memory read instruction that is of interest, capturing memory as needed on-the-fly. In the example above, there are two memory read instructions. crete Tracer monitors both of them, but only keeps the memory state taken from instruction (1) as a part of the initial state of memory, because instruction (1) and (9) access the same address.

The side effects of hardware states are captured by monitoring uncaptured write instructions of hardware states. In the example in Fig. 2, instructions (5) and (6) write CPU registers which cause side effects to the CPU state. crete Tracer monitors those instructions and keeps the updated register values as part of the runtime trace. As register r1 is updated twice by two instructions, only the last update is kept in the runtime trace. Similarly, crete Tracer captures the side effect of memory at address 0x5678 by monitoring instruction (7).

4.4 Concolic Test Case Generation

While a standardized trace is a self-contained llvm module and can be directly executed by a llvm interpreter, it opens interfaces to see to inject symbolic values for test case generation. Normally see injects symbolic values by making a variable in source code symbolic. From source code level to machine code level, references of variables by names have become memory accesses by addresses. For instance, a reference of a concrete input variable of a program becomes a access of a piece of memory that stores the state of that input variable. crete injects self-defined helper function, crete_make_concolic, to the captured basic blocks while capturing trace. This helper function provides the address and size of the piece of memory for injecting symbolic values, along with a name to offer better readability for test case generation. By catching this helper function, see can introduce symbolic values at the right time and right place.

A standardized trace in crete represents only a single path of a tbp as shown in Fig. 3(a). Test case generation on this trace with näive symbolic execution by see won’t be effective, as it ignores the single path nature of the trace. As illustrated in Fig. 3(b), native symbolic replay of crete trace produces execution states and test cases that are exponential to the number of branches within the trace. As shown in Fig. 3(c), with concolic replay of crete trace, the see in crete maintains only one execution state, requiring minimal memory usage, and generates a more compact set of test cases, whose number is linear to the number of branches in that trace. For a branch instruction in a captured basic block, if both of the paths are feasible given the collected constraints so far on the symbolic values, the see in crete only keeps the execution state of the path that was taken by the original concrete execution in the vm by adding the corresponding constraints of this branch instruction, while generating a test case for the other path by resolving constraints with the negated branch condition. This generated test case can lead the tbp to a different execution path later during the concrete execution in the vm.

Fig. 3.
figure 3

Execution tree of the example trace from Fig. 2: (a) for concrete execution, (b) for symbolic execution, and (c) for concolic execution.

4.5 Bug and Runtime Vulnerability Detection

crete detects bugs and runtime vulnerabilities in two ways. First, all the native checks embedded in see are checked during the symbolic replay over the trace captured from concrete execution. If there is a violation to a check, a bug report is generated and associated with the test case that is used in the vm to generate this trace. Second, since crete does not change the native testing process and simply provides additional test cases that can be applied in the native process, all the bugs and vulnerability checks that are used in the native process are effective in detecting bugs and vulnerabilities that can be triggered by the crete generated test cases. For instance, Valgrind [26] can be utilized to detect memory related bugs and vulnerabilities along the paths explored by crete test cases.

5 Implementation

To demonstrate the practicality of crete, we have implemented its complete workflow with qemu [23] as the frontend and klee [5] as the backend respectively. And to demonstrate the extensibility of crete, we have also developed the tracing plug-in for the 8051 emulator which readily replaces qemu.

Tracer for To give crete the best potential of supporting various guest platforms supported by qemu, crete Tracer captures the basic blocks in the format of qemu-ir. To convert captured basic blocks into standardized trace format, we implemented a qemu-ir to llvm translator based on the x86-llvm translator of \(\textsc {s}^2\textsc {e}\) [37]. We offload this translation from the runtime tracing as a separate offline process to reduce the runtime overhead of crete Tracer. qemu maintains its own virtual states to emulate physical hardware state of a guest platform. For example, it utilizes virtual memory state and virtual CPU state to emulate states of physical memory and CPU. Those virtual states of qemu are essentially source-level structs. crete Tracer captures hardware states by monitoring the runtime values of those structs maintained by qemu. qemu emulates the hardware operations by manipulating those virtual states through corresponding helper functions defined in qemu. crete Tracer captures the side effects on those virtual hardware states by monitoring the invocation of those helper functions. As a result, the initial hardware states being captured are the runtime values of these qemu structs, and the side effects being captured are the side effects on those structs from the uncaptured instructions.

Replayer for klee takes as input the llvm modules compiled from C source code. As the crete trace is a self-contained llvm module, crete Replayer mainly injects symbolic values and achieves concolic test generation. To inject symbolic values, crete Replayer provides a special function handler for crete interface function crete_make_concolic. klee is an online symbolic executor natively, which forks execution states on each feasible branches and explores all execution paths by maintaining multiple execution states simultaneously. To achieve concolic test generation, crete Replayer extends klee to generate test cases only for feasible branches while not forking states.

Tracer for 8051 Emulator: The 8051 emulator executes a 8051 binary directly by interpreting its instructions sequentially. For each type of instruction, the emulator provides a helper function. Interpreting an instruction entails calling this function to compute and change the relevant registers and memory states. The tracing plug-in for the 8051 emulator extends the interpreter. When the interpreter executes an instruction, an llvm call to its corresponding helper function is put in the runtime trace. The 8051 instruction-processing helper functions are compiled into llvm and incorporated into the runtime trace serving as the helper functions that map the captured instructions to the captured runtime states. The initial runtime state is captured from the 8051 emulator before the first instruction is executed. The resulting trace is of the same format as that from qemu and is readily consumable by klee.

6 Evaluation

In this section, we present the evaluation results of crete from its application to GNU Coreutils [38] and TianoCore utility programs for UEFI BIOS [39]. Those evaluations demonstrate that crete generates effective test cases that are as effective in achieving high code coverage as the state-of-the-art tools for automated test case generation, and can detect serious deeply embedded bugs.

6.1 GNU Coreutils

Experiment Setup. GNU Coreutils is a package of utilities widely used in Unix-like systems. The 87 programs from Coreutils (version 6.10) contain 20, 559 lines of code, 988 functions, 14, 450 branches according to lcov [40]. The program size ranges from 18 to 1, 475 in lines, from 2 to 120 in functions, and from 6 to 1, 272 in branches. It is an often-used benchmark for evaluating automated program analysis systems, including klee, MergePoint and others [5, 12, 41]. This is why we chose it as the benchmark to compare with klee and angr.

crete and angr generates test cases from program binaries without debug information, while klee requires program source code. To measure and compare the effectiveness of test cases generated from different systems, we rerun those tests on the binaries compiled with coverage flag and calculate the code coverage with lcov. Note that we only calculate the coverage of the code in GNU Coreutils itself, and do not compute code coverage of the library code.

We adopted the configuration parameters for those programs from klee’s experiment instructionsFootnote 1. As specified in the instructions, we ran klee on each program for one hour with a memory limit of 1 GB. We increased the memory limit to 8 GB for the experiment on angr, while using the same timeout of one hour. crete utilizes a different timeout strategy, which is defined by no new instructions being covered in a given time-bound. We set the timeout for crete as 15 min in this experiment. This timeout strategy was also used by DASE [41] for its evaluation on Coreutils. We conduct our experiments on an Intel Core i7-3770 3.40 GHz CPU desktop with 16 GB memory running 64-bit Ubuntu 14.04.5. We built klee from its release v1.3.0 with llvm 3.4, which was released on November 30, 2016. We built angr from its mainstream on Github at revision e7df250, which was committed on October 11, 2017. crete uses Ubuntu 12.04.5 as the guest OS for its vm front-end in our experiments.

Table 1. Comparison of overall and median coverage by klee, angr, and crete on Coreutils.
Table 2. Distribution comparison of coverage achieved by klee, angr, and crete on Coreutils.

Comparison with and As shown in Table 1, our experiments demonstrate that crete achieves comparable test coverage to klee and generally outperforms angr. The major advantage of klee over crete is that it works on source code with all semantics information available. When the program size is small, symbolic execution is capable of exploring all feasible paths with given resources, such as time and memory. This is why klee can achieve great code coverage, such as line coverage over 90%, on more programs than crete, as shown in Table 2. klee requires to maintain execution states for all paths being explored at once. This limitation becomes bigger when size of program gets bigger. What’s more, klee analyzes programs within its own virtual environment with simplified model of real execution environment. Those models sometimes offer advantages to klee by reducing the complexity of the tbp, while sometimes they lead to disadvantages by introducing inaccurate environment. This is why crete gradually caught up in general as shown in Table 2. Specifically, crete gets higher line coverage on 33 programs, lower on 31 programs, and the same on other 23 programs. Figure 4(a) shows the coverage differences of crete over klee on all 87 Coreutils programs. Note that our coverage results for klee are different from klee’s paper. As discussed and reported in previous works [12, 41], the coverage differences are mainly due to the major code changes of klee, an architecture change from 32-bit to 64-bit, and whether manual system call failures are introduced.

Fig. 4.
figure 4

Line coverage difference on Coreutils by crete over klee and angr: positive values mean crete is better, and negative values mean crete is worse.

angr shares the same limitation as klee requiring to maintain multiple states and provide models for execution environment, while it shares the disadvantage of crete in having no access to semantics information. Moreover, angr provides models of environment at machine level supporting various platforms, which is more challenging compared with klee’s model. What’s more, we found and reported several crashes of angr from this evaluation, which also affects the result of angr. This is why angr performs worse than both klee and crete in this experiment. Figure 4(b) shows the coverage differences of crete over angr on all 87 Coreutils programs. While crete outperformed angr on majority of the programs, there is one program printf that angr achieved over 40% better line coverage than crete, as shown in the left most column in Fig. 4(b). We found the reason is printf uses many string routines from libc to parse inputs and angr provides effective models for those string routines. Similarly, klee works much better on printf than crete.

Coverage Improvement over Seed Test Case. Since crete is a concolic testing framework, it needs an initial seed test case to start the test of a tbp. The goal of this experiment is to show that crete can significantly increase the coverage achieved by the seed test case that the user provides. To demonstrate the effectiveness of crete, we set the non-file argument, the content of the input file and the stdin to zeros as the seed test case. Of course, well-crafted test cases from the users would be more meaningful and effective to serve as the initial test cases. Figure 5 shows the coverage improvement of each program. On average, the initial seed test case covers 17.61% of lines, 29.55% of functions, and 11.11% of branches. crete improves the line coverage by 56.71%, function coverage by 53.44%, and branch coverage by 52.14% respectively. The overall coverage improvement on all 87 Coreutils programs is significant.

Fig. 5.
figure 5

Coverage improvement over seed test case by crete on GNU Coreutils

Bug Detection. In our experiment on Coreutils, crete was able to detect all three bugs on mkdir, mkfifo, and mknod that were detected by klee. This demonstrates that crete does not sacrifice bug detection capacity while working directly on binaries without debug and high-level semantic information.

6.2 TianoCore Utilities

Experiment Setup. TianoCore utility programs are part of the open-source project EDK2 [42], a cross-platform firmware development environment from Intel. It includes 16 command-line programs used to build BIOS images. The TianoCore utility programs we evaluated are from its mainstream on Github at revision 75ce7ef committed on April 19, 2017. According to lcov, the 16 TianoCore utility programs contain 8, 086 lines of code, 209 functions, and 4, 404 branches. Note that we only calculate the coverage of the code for TianoCore utility programs themselves, and do not compute the coverage of libraries.

The configuration parameters we used on those utility programs are based on our rough high-level understanding of these programs from their user manuals. We assigned each program a long argument of 16 Bytes, and four short arguments of 2 Bytes, along with a file of 10 Kilobytes. We conduct our experiments on the same platform with the same host and guest OS as we did for the Coreutils evaluation, and set the timeout also as 15 min for each program.

High Coverage Test Generation From Scratch. For all the arguments and file contents in the parameter configuration, we set their initial value as binary zeros to serve as the seed test case of crete. Figure 6 shows that crete delivered high code coverage, above 80% line coverage, on 9 out of 16 programs. On average, the initial seed test case covers 14.56% of lines, 28.71% of functions, and 12.38% of branches. crete improves the line coverage by 43.61%, function coverage by 41.63%, and branch coverage by 44.63% respectively. Some programs got lower coverage because of: (1) inadequate configuration parameters; (2) error handling code triggered only by failed system calls; (3) symbolic indices for arrays and files not well handled by crete.

Fig. 6.
figure 6

Coverage improvement over seed test case by crete on TianoCore utilities

Bug Detection. To further demonstrate crete’s capability in detecting deeply embedded bugs, we performed a set of evaluations focusing on concolic file with crete on TianoCore utility programs. From the build process of a tutorial image, OvmfPkg, from EDK2, we extracted 509 invocations to TianoCore utility programs and the corresponding intermediate files generated, among which 37 unique invocations cover 6 different programs. By taking parameter configurations from those 37 invocations and using their files as seed files, we ran crete with a timeout of 2 h on each setup, in which only files are made symbolic.

Table 3. Classified crashes found by crete on Tianocore utilities: 84 unique crashes from 8 programs

Combining experiments on concolic arguments and concolic files, crete found 84 distinct crashes (by stack hash) from eight TianoCore utility programs. We used a GDB extension [43] to classify the crashes, which is a popular way of classifying crashes for AFL users [44]. Table 3 shows that crete found various kinds of crashes including many exploitable ones, such as stack corruption, heap error, and write access violation. There are 8 crashes that are found with concolic arguments while the other 76 crashes are found with concolic files. We reported all those crashes to the TianoCore development team. So far, most of the crashes have been confirmed as real bugs, and ten of them have been fixed.

We now elaborate on a few sample crashes to demonstrate that the bugs found by crete are significant. VfrCompile crashed with a segmentation fault due to stack corruption when the input file name is malformed, e.g., as generated by crete. This bug is essentially a format string exploit. VfrCompile uses function vsprintf() to compose a new string from a format string and store it in a local array with a fixed size. When the format string is malicious, like , function vsprintf() will keep reading from the stack and the local buffer will be overflowed, hence causing a stack corruption. Note that crete generated a well-formed prefix for the input, , which is required to pass a preprocessing check from VfrCompile, so that the malicious format string can attack the vulnerable code.

crete also exposed several heap errors on GenFw by generating malformed input files. GenFw is used to generate a firmware image from an input file. The input file needs to follow a very precise file format, because GenFw checks the signature bytes to decide the input file type, uses complex nested structs to parse different sections of the file, and conducts many checks to ensure the input file is well-formed. Starting from a seed file of 223  Kilobyte extracted from EDK2’s build process, crete automatically mutated 29 bytes in the file header. The mutated bytes introduced a particular combination of file signature and sizes and offsets of different sections of the file. This combination passed all checks on file format, and directed GenFw to a vulnerable function which mistakenly replaces the buffer already allocated for storing the input file with a much smaller buffer. Follow-up accesses of this malformed buffer caused overflow and heap corruption.

7 Conclusions and Future Work

In this paper, we have presented crete, a versatile binary-level concolic testing framework, which is designed to have an open and highly extensible architecture allowing easy integration of concrete execution frontends and symbolic execution backends. At the core of this architecture is a standardized format for binary-level execution traces, which is llvm-based, self-contained, and composable. Standardized execution traces are captured by concrete execution frontends, providing succinct and sufficient information for symbolic execution backends to reproduce the concrete executions. We have implemented crete with klee as the symbolic execution engine and multiple concrete execution frontends such as qemu and 8051 Emulator. The evaluation of Coreutils programs shows that crete achieved comparable code coverage as klee directly analyzing the source code of Coreutils and generally outperformed angr. The evaluation of TianoCore utility programs found numerous exploitable bugs.

We are assembling a suite of 8051 binaries for evaluating crete and will report the results in the near future. Also as future work, we will develop new crete tracing plugins, e.g., for concrete execution on physical machines based on PIN. With these new plugins, we will focus on synthesizing abstract system-level traces from trace segments captured from binaries executing on various platforms. Another technical challenge that we plan to address is how to handle symbolic indices for arrays and files, so code coverage can be further improved.