Keywords

1 Introduction

Rust’s emphasis on memory safety is well-recognized, with its rigorous ownership based type system effectively eliminating numerous memory-safety issues during compilation. As a result, critical systems [27, 32] are increasingly being built from the ground up in Rust. However, Rust, while making substantial strides in enhancing memory safety, still permits developers to write unsafe code, which undermines its safety guarantees and provides developers with powerful but risky capabilities [31]. Recent studies [13, 24, 33] have highlighted that unsafe code remains the predominant source of memory-safety problems in Rust.

BMC (Bounded Model Checking) [15, 16, 25] is a widely used technique for verifying memory-safety properties in unsafe Rust. It encodes program traces as symbolic SAT/SMT problems and employs solvers to provide bounded proofs. However, BMC has limitations, including the need for a fixed number of program executions, which requires setting bounds on loop iterations. Smaller loop bounds may result in incomplete unwinding, potentially missing genuine bugs. On the other hand, overly large bounds can lead to memory exhaustion and termination of the checker. Additionally, BMC’s effectiveness is constrained when dealing with complex code, especially paths involving intricate functions, as the generated formulas can become too complex for practical solver handling. These challenges collectively impede BMC’s utility in real-world code verification.

In this paper, we present UnsafeCop, which utilizes and enhances Kani [15], a bounded model checker, to verify memory safety in real-world unsafe Rust code. Our approach identifies functions that execute unsafe code and generates proof harnesses with test cases. Abstract interpretation determines loop bounds for BMC, and we implement stubbing for loops with large bounds to maintain essential safety checks and ensure soundness. Our BMC model checking uses a scheduling strategy that prioritizes complex, frequently invoked functions, using stubs for others to increase verification efficiency. We validated UnsafeCop ’s effectiveness through a case study on the TECC (Trusted-Environment-based Cryptographic Computing) framework, a proprietary software by Ant Group that includes 30,174 lines of Rust code, with 3,019 lines of unsafe Rust code. TECC integrates trusted computing with secure multi-party computation techniques to foster a secure, reliable, and high-performance computing environment for large-scale data applications. This case study served as a robust test environment for assessing UnsafeCop ’s memory safety capabilities.

In summary, this paper makes the following three main contributions:

  • A practical BMC approach for detecting memory-safety issues in Rust programs, which includes harness design, loop bound inference, stubbing complex loops, and optimizing function verification order by utilizing function stubbing for improved performance.

  • The evaluation of UnsafeCop on a real-world project comprising 30,174 lines of Rust code, with 3,019 lines being written in unsafe Rust.

  • Insights and lessons learned from verifying real-world Rust programs, along with suggestions for improving current model-checking tools such as Kani [15], particularly for Rust.

2 Related Work

Program analysis [7, 17, 18, 22, 23] can identify memory-safety bugs in Rust code but do not guarantee soundness. Formal methods, including theorem proving [14, 20, 21, 30] and deductive proving [2, 8], are used to verify functional properties in safe Rust. Techniques like abstract interpretation [10, 22], symbolic execution [19, 26, 28], and bounded model checking (BMC) [15, 16, 25] are focused on ensuring memory safety properties. BMC stands out by encoding program traces into symbolic SAT/SMT problems for solver-based automatic verification. Addressing challenges like appropriate loop bounds and managing path explosion is essential for BMC to be practically applicable in large-scale codebase verification.

3 Overview

We introduce UnsafeCop, a method for verifying memory safety in real-world unsafe Rust code. Our approach, depicted in Fig. 1, starts by locating all unsafe code in the project and identifying functions exposed to other crates that execute unsafe code. Test cases for these exposed functions are transformed into proof harnesses. We infer loop bounds for accurate program modeling and apply loop stubbing with memory safety checks for loops with large bounds. Using Kani [15], a bit-precise BMC for Rust, we perform model checking on TECC. Our strategy optimizes function verification order, prioritizing high-complexity, frequently-invoked functions, followed by stubbing verified functions to check the rest. Kani generates counterexamples for any detected memory issues, which are iteratively fixed. Our focus is on memory safety in user-provided unsafe code, excluding unsafe code from the Rust standard library and third-party crates.

Fig. 1.
figure 1

Architecture of UnsafeCop.

3.1 Verification Scope and Proof Harness

In the verification model detailed in Def-1 below, the scope of verification, designated as TF, includes the set of functions necessitating proof harnesses. This set encompasses public functions that have the capability to access user-provided unsafe code, identified as USF, which includes unsafe blocks, unsafe functions, or interfaces using Foreign Function Interfaces (FFI). A function f is classified as public to other crates if it \( isPublic (f) = true \). Furthermore, the function \( Reach (f, usf )\) assesses whether there is a program path that allows function f to access any unsafe code \( usf \), determined via reachability analysis on the program’s interprocedural Control Flow Graph (iCFG) as is standard.

$$ \begin{aligned} TF = \{f \mid isPublic (f) \ \& \& \ Reach (f, usf ), usf \in USF \} \end{aligned}$$
(Def-1)

To ensure memory safety in unsafe Rust, it is essential to create and verify harnesses for functions within the set TF. The process of developing a harness for a target public function is carried out in three stages.

First, the harness must encompass all possible calling scenarios of the target public function. This is effectively achieved by leveraging existing test cases, including both integration and unit tests available within the project. These tests, carefully crafted by developers, not only demonstrate how the target public function is used but also ensure necessary initializations and data setups are performed before the function is invoked. If the target public function lacks sufficient tests, we collaborate closely with developers to design a harness that accurately reflects real-world invoking scenarios.

Additionally, if the target public function uses generic types, either in its parameters or within the function itself, which can significantly alter control flow, we collaborate with developers to identify all appropriate concrete types. It is crucial for the harness to explore all possible instantiations of these generics to capture varied control flow paths, ensuring thorough testing and verification.

Finally, it is essential for the harnesses to ensure thorough code coverage. This coverage might change due to code adjustments when memory safety issues are identified during verification. After resolving all detected issues, if code coverage is still found to be insufficient, collaborating with developers to adjust the range of values for symbolic variables may be necessary to achieve more comprehensive coverage. A harness is deemed correctly generated when the associated public function is verified to be free of bugs and achieves sufficient coverage.

3.2 Loop Bound Inference

BMC offers bounded proofs, ensuring that within given loop bounds, the program satisfies certain properties. However, if the loop bound is too small, it may lead to semantic deviations from the actual program, risking missed memory safety bugs. Setting an appropriate loop bound is key for maintaining verification soundness and detecting memory safety issues.

Fig. 2.
figure 2

Loop bound inference. (Color figure online) demonstrates a loop with the constant bound T::WIDTH. In , values of nr_rnd and rm_bit are derived from other constants. In , the loop counter is determined through widening within the interval domain.

We utilize abstract interpretation [6] to deduce loop bounds. Observations indicate that many loops, like those in the first two code snippets of Fig. 2, have identifiable patterns (i.e., clear signatures). In the first example, the loop’s upper bound is a literal constant directly. In the second example, the loop counters rely on variable values derived from constants or intervals. Here, we calculate each loop counter’s interval using the interval domain [5] in abstract interpretation, with the interval’s upper bound serving as the loop bound.

Table 1. Intervals of the loop index i for the third code snippet in Fig. 2 at each iteration, with the fixed-point algorithm set to iterate for four times.

For loops lacking clear patterns (i.e., clear signatures), as in the third code snippet, we infer loop bounds using widening in the interval domain, achieving a fixed-point in the analysis. This involves setting a predefined number of iterations for the abstract interpretation’s fixed-point algorithm. As the loop progresses, we monitor the intervals of accessed variables. Upon completing the final iteration, these variables are widened to a fixed-point state. The upper bound of the widened interval for a loop index variable becomes the loop bound. Table 1 shows the loop index i’s intervals at each iteration, with the algorithm iterating four times, resulting in a final loop bound of 7.

3.3 Loop Stubbing

Some loops yield excessively large bounds from widening, making complete unwinding infeasible for BMC, which might lead to out-of-memory issues when generating verification conditions. To address this, we apply loop stubbing to rewrite loops with unfeasible bounds for unwinding.

Fig. 3.
figure 3

Loop stubbing (top: original code; bottom: loop-stubbed code).

We present loop stubbing guidelines using the example depicted in Fig. 3:

Iterator-like Variables. For iterator-like variables, such as loop index variables and references to elements in array-like structures, we substitute their values with symbolic intervals. As demonstrated in Fig. 3, the iterator i accesses vectors like a.0 and a.1, with their sizes represented by the interval [0, a.size()). Additionally, new references like rx and rxu are introduced to symbolize the iterators of accessed vectors, such as &a.0[i], &a.1[i].

Memory-Safety Properties. For array-like structures, we use assert for access validation, as demonstrated in Fig. 3. For example, the assertion \( \texttt {assert!(a.size()==b.size() \& \& b.size()==c.size())}\) ensures accesses to vectors remain within their boundaries. Based on the new references, rx, rxu, ry, ryu, rr, and rru introduced above, statements *r = *x + *y and *ru = *xu + *yu are replaced with r = x + y and ru = xu + yu, respectively. Dereference and arithmetic overflow checks are maintained.

Side-Effect Over-Approximation. Side effects arise from write operations inside a loop affecting variables declared outside it. To address this, we over-approximate these variables by assigning them appropriate intervals. For example, the elements in c.0 and c.1, which are overwritten within the loop in Fig. 3, can assume any value of type T. With rr and rru being iterator-like symbolic variables, the last two assignments result in the specified over-approximation.

When a loop contains function calls, we assess the callee functions for complexity and call frequency. Functions that are complex and frequently called are verified first according to our scheduling strategy and replaced with stubs in subsequent analyses. Simpler functions, on the other hand, undergo standard analysis. We will provide more details on this approach in Sect. 3.4.

In loop stubbing, we preserve memory-safety checks and over-approximate loops’ external side effects to prevent BMC from stalling, enabling thorough analysis. Over-approximation is automated, whereas safety check preservation is manual, ensuring no memory safety issues are missed and maintaining soundness.

3.4 Scheduling Strategy

The time spent on constraint solving is significant. The order in which functions are verified affects both the quantity and complexity of the generated verification conditions. This, in turn, influences the duration of constraint solving and the overall performance of the verification process.

We use Def-2 to denote a specific path analyzed when verifying a harness:

$$\begin{aligned} \begin{aligned} VerifyPath _i = t \rightarrow f_1 \rightarrow ... \rightarrow f_n\rightarrow usf _i, \quad \text {where}\ t \in TF\ \text {and}\ usf _i \in USF \end{aligned} \end{aligned}$$
(Def-2)

where t represents a target public function undergoing verification, \( usf _i\) refers to reachable unsafe code from t, and \(f, ..., f_n\) are n additional functions executed along the analyzed path. For recursion, a bounded depth of 1 is used. To ensure the absence of memory safety issues, all paths starting from a harness must be explored. The straightforward but inefficient approach of verifying all the paths in a function individually is recognized. When a specific function f, particularly one with complex logic, recurs across multiple paths, it can significantly affect the time allocated for constraint solving in the program’s overall verification process, due to the unnecessary multiple analyses of f.

We employ a scheduling strategy to optimize the order of function verification. For each path outlined in Def-2, if a function f is frequently invoked and sufficiently complex, we include it in a set, \(F_{sche}\), as defined in Def-3 below:

$$\begin{aligned} \begin{aligned} F_{sche} = \{ f \mid Invk (f) \times Cmplx (f) > T\_ invk\_cmplx \} \end{aligned} \end{aligned}$$
(Def-3)

The “verification complexity” of f is determined by multiplying its invocation frequency, denoted as \( Invk (f)\), with its computational complexity, referred to as \( Cmplx (f)\). We use a predefined threshold, \(T\_ invk\_cmplx \), for this calculation.

\( Invk (f)\) represents the in-degree of f in the program’s iCFG. To determine Cmplx(f), we take into account both the Halstead effort [11] and cyclomatic complexity [9]. Halstead effort focuses on understanding difficulty, accounting for code length, operations, and operators, while cyclomatic complexity addresses control flow complexity. Both metrics collectively indicate the complexity of a function. We utilize rust-code-analysis [1] to calculate both metrics and their product serves as an indicator of a function’s complexity.

To execute this verification order, we first verify the functions in \(F_{sche}\). Then, to verify the remaining functions, we substitute the original functions at their callsites with their respective stubs. Each stub is an over-approximation of its function, computed in a manner akin to a loop stub, as detailed in Sect. 3.3.

3.5 Memory-Safety Verification

We use Kani-0.22.0 [15], compiled with the CadiCal solver [4], for model checking, employing the “ memory-safety-checks” option. We have not set a timeout for the solver. Loop unwinding information is provided using the “ cbmc-args unwindset \(L_1:B_1\),\(L_2:B_2\) ...” option, where \(B_i\) indicates the inferred loop bound for loop \(L_i\). In alignment with our scheduling strategy, the functions in \(F_{sche}\) were given verification priority. For the verification of subsequent functions, kani::stub was utilized to stub the already verified functions in \(F_{sche}\).

4 Evaluation

TECC, short for Trusted-Environment-based Cryptographic Computing, is a proprietary framework developed by Ant Group. It integrates trusted computing with multi-party computation techniques, aiming to provide a secure, reliable, and high-performance computing environment for large data applications.

TECC consists of 3,060 lines of C and 30,174 lines of Rust, including 3,019 lines of unsafe Rust as shown in Fig. 4(a). Rust handles critical tensor computations, while C oversees computation algorithms that interface with Rust via FFI. The unsafe Rust includes 68 unsafe blocks across 27 functions (351 lines), 6 unsafe functions (106 lines), and 96 FFI functions (2,562 lines).

To demonstrate UnsafeCop ’s capability in identifying memory-safety issues, we applied it to verify TECC in a major case study. We dedicated approximately 115 person-hours to verifying 7,118 lines of Rust code, which includes 3,019 lines of unsafe code and 4,099 lines of safe code. The percentages of these lines relative to the total Rust code (excluding the C code) are shown in Fig. 4(b).

Fig. 4.
figure 4

Percentage breakdown of implemented and verified code w.r.t. memory safety.

We employed Kani-0.22.0 [15] for Bounded Model Checking (BMC) on our TECC project. Kani serves as a backend for the Rust compiler and utilizes the C Bounded Model Checker (CBMC) [16] as its verification engine. It is specifically designed to target Rust’s Mid-level Intermediate Representation (MIR), which we used for our interval analysis on MIR. Before initiating verification of the Rust code, we resolved undefined behaviors and memory safety issues in the C code using the TrustInSoft Analyzer [29]. Additionally, all C functions callable from Rust were stubbed to over-approximate their side effects.

The evaluation was conducted using an Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz with 16 cores and 128GB RAM, running Ubuntu 18.04.6 LTS.

4.1 Harness Design

Below, we outline the process involved in harness design in verifying TECC.

Verification Scope. In TECC, 243 public functions have the capability to access unsafe code. Of these, developers confirmed that 110 are accessible to other crates. Among these 110, 96 have corresponding FFI wrappers, as shown in Fig. 5. Both the FFI functions and the public functions they wrap receive the same inputs; however, the FFI functions are more prone to memory safety issues as they directly receive data from C code. Consequently, we focused on developing specific harnesses for these FFI functions, as depicted in Fig. 6.

Fig. 5.
figure 5

Public function lt_zero_less_turn\(\texttt {{<}{A64}{,} {B64}{>}}\) and its FFI wrapper ffi_ lt_zero_less_turn_\(\texttt {A64\_B64}\), with both sharing the same inputs.

Fig. 6.
figure 6

Harness development for the FFI function given in Fig. 5 ffi_lt_zero _less_turn_\(\texttt {A64\_B64}\) instead of its non-FFI version lt_zero_less_turn\(\texttt {{<}{A64}{,}{B64}{>}}\), as FFI functions are more prone to memory safety problems.

Harness Writing. TECC features 57 integration and unit tests, covering 89 of the 110 public functions that require verification harnesses. We converted these existing tests into harnesses and developed 21 new ones to achieve full coverage of all necessary functions. The creation of these 78 harnesses consumed 20% of the total verification effort, translating to approximately 23 person-hours.

During the harness development process, we collaborated with TECC developers to remove 14 public functions with access to unsafe code from the codebase. These functions were deemed unlikely to be called externally, thus reducing the potential attack surface associated with unsafe code.

Coverage Statistics. For each public function harness in TECC, we considered all potential calling scenarios, covering all possible values for symbolic variables to ensure thorough code coverage. On average, two rounds of discussions with developers were conducted for each harness to confirm the calling scenarios and appropriate ranges for symbolic variables. Ultimately, we achieved approximately 95% statement coverage for all verified functions, resolving 39 identified memory-safety issues through 19 rounds of discussions with developers. Throughout the verification process, there were no adjustments to the value ranges of symbolic variables to expand code coverage.

Vacuity Checks. The correctness of each harness was ensured through consultations with developers and complemented by automatic vacuity checks to confirm property reachability using Kani. Properties identified as vacuous were marked as UNREACHABLE in Kani’s output. We specifically focused on these properties, which represented less than 0.3% of all verified properties. Our review confirmed that there were no possible traces that could reach these properties given the determined data setups in our harnesses.

4.2 Improvements on Verification Efficiency

We demonstrate the improvements achieved by applying loop stubbing combined with our scheduling strategy that also incorporates function stubbing.

Loop Stubbing. We illustrate the performance improvement achieved through loop stubbing using the TECC function add, as previously shown in Fig. 3. According to developers, the practical loop bound a.size() can reach 100 million. Without loop stubbing, we set this loop bound and attempted verification, resulting in Kani spending hours in the pre-processing step before being terminated due to running out of memory. In contrast, after applying loop stubbing, Kani successfully produced the desired verification result in 6.75 h.

Scheduling Strategy. Fig. 7 illustrates the performance advantages of our scheduling strategy, introduced in Sect. 3.4, denoted as Inter-Proc. This strategy is compared to two simpler approaches, Non-Stubbing and Intra-Proc, for four functions within TECC. Non-Stubbing involves the individual analysis of all functions without the use of stubs to prevent re-analysis of callee functions. Intra-Proc conducts individual function analysis but utilizes stubs to avoid redundant analysis of the same callee function called from within the same function.

Compared to Non-Stubbing, Intra-Proc reduces the average verification time by 51.76% for the four functions: chebyshev, sqrt, log2, and cos. In the case of chebyshev, Inter-Proc and Intra-Proc yield the same performance. On average, Inter-Proc improves performance over Non-Stubbing by 70.75%. When considering only sqrt, log2, and cos, the average performance gain is 78.28%.

Fig. 7.
figure 7

Verification times of Non-Stubbing, Intra-Proc and Inter-Proc for four functions selected in TECC (with Non-Stubbing and Intra-Proc defined in Sect. 4.2).

When expanding our evaluation to the entire TECC codebase, it took us around 115 h to verify 110 public functions with Inter-Proc, which accesses unsafe Rust code. During this process, we addressed and fixed the 39 reported bugs (detailed in Sect. 4.3). In contrast, we estimate that Non-Stubbing would require around 437 person-hours to accomplish the same verification task. To assess the performance improvement of Inter-Proc over Non-Stubbing, we estimate the verification times spent for both scheduling strategies as follows:

$$\begin{aligned} T_{ Non-Stubbing } = \sum _{f \in F_{ exam }}{ Invk (f) \times Cmplx (f)} \end{aligned}$$
(Def-4)
$$\begin{aligned} T_{ Inter-Proc } = \sum _{f\in F_{sche}}{1 \times Cmplx (f)} + \sum _{f\in F_{ exam }-F_{ sche }}{ Invk (f) \times Cmplx (f)} \end{aligned}$$
(Def-5)

where \(F_{ exam }\) denotes the set of functions examined by BMC. For the functions in \(F_{ sche }\), their invocation times are assumed to be 1, as each is verified once.

Based on our observations, some frequently called functions have low verification complexity, while others with complex logic are invoked infrequently. In such cases, it is reasonable to skip stubbing these functions, as it would not significantly impact the overall verification time.

We computed the product of \( Invk (f)\) and \( Cmplx (f)\) for all functions in \(F_{ exam }\) and utilized their harmonic mean [12] as the threshold \(T\_ invk\_cmplx \) in Def-3 to eliminate functions with extremely low verification complexity. For TECC, we initially had \(|F_{ exam }|=196\). By setting \(T\_ invk\_cmplx = 3360.16\), we obtained \(|F_{ sche }|= 168\) after filtering out 28 functions. The harmonic mean for \(F_{ sche }\) is 20014.82. Ultimately, Inter-Proc is estimated to reduce overall verification time by approximately \(73.71\%\) compared to Non-Stubbing.

4.3 Effectiveness

In the verification of TECC, UnsafeCop detected a total of 39 memory-safety issues, as detailed in Table 2. All of these issues were confirmed and addressed by the developers. Subsequently, UnsafeCop verified their absence, ensuring that the identified memory safety problems had been effectively resolved.

Table 2. Memory safety problems detected in TECC by UnsafeCop.

When performing stubbing to summarize loops and functions, we approximate their side effects using the interval domain. It is worth noting that this approach did not introduce any false positives, as TECC comprises loops and functions that operate on unrestricted values within their respective domains.

UnsafeCop comprises four main components: Harness Design (HD), Loop Bound Inference (LBI), Loop Stubbing (LS), and Inter-Proc Scheduling (IS), where function stubbing is also performed. These elements collectively contribute to practical verification efforts. Table 3 illustrates their roles in identifying memory-safety issues in TECC. HD is instrumental in detecting all bugs, with HD-O representing bugs exclusively identified by HD. Both LS and IS prove highly effective, detecting the majority of bugs and demonstrating their capabilities for in-depth analysis.

Table 3. Contributions of UnsafeCop ’s four main components to its overall effectiveness in uncovering memory-safety issues in TECC.

Let us examine two case studies to see how UnsafeCop identifies two bugs.

Case Study 1. HD+LS. Fig. 8 depicts an arithmetic overflow bug discovered in the function truncate, callable from the public FFI ffi_truncate. In the ffi_truncate harness, the function’s arguments are assigned intervals, with the second representing a slice with a length that can vary from 0 to 100 million. The bug occurs when the slice has a size of zero, triggering an arithmetic overflow.

This bug was blocked behind truncate’s initial loop, which had an excessively large bound to practically unwind. Nevertheless, by stubbing the initial loop, Kani managed to perform a more thorough analysis that extended beyond the loop’s limits, eventually uncovering the arithmetic overflow.

Fig. 8.
figure 8

Buggy function truncate and its harness.

Case Study 2. HD+IS+LS. Fig. 9 shows an out-of-bounds access bug in the unsafe block of cvt_repr. This function can be called by the public function bit_extr. We created a harness for bit_extr as depicted in the figure.

The functions and, xor, and xor_assign are both complex and frequently invoked from within bit_extr. Initially, loop stubbing (LS) was applied to the loop, but Kani still got stuck before it could analyze the buggy function cvt_repr. The complexity arose from multiple calls to xor_assign preceding the invocation of cvt_repr, overwhelming the solver. By applying Inter-Proc Scheduling (IS) to xor, xor_assign, and and, Kani successfully reached the analysis of cvt_repr. The upper bounds for nr and rb are 64 and 1, respectively, which places ix in the range [0, 64). Meanwhile, the length of rx is 1. The out-of-bounds access occurs when ix exceeds the length of rx.

Fig. 9.
figure 9

Buggy function cvt_repr and its harness.

4.4 Insights and Lessons Learned with Suggestions

We share lessons learned and insights gained from verifying TECC, while also suggesting ways to further improve model-checking tools like Kani for Rust.

Verification Scope. In Rust projects, unsafe code constitutes a small portion of the codebase [3]. To ensure a balance between effort and effectiveness, it is important to define the appropriate scope for verification. In practice, verifying public functions that can reach unsafe code is sufficient since unsafe code is often executed through exposed functions from other crates.

Harness Development. Generating high-quality harnesses is paramount for the verification process. Utilizing integration instead of unit tests is advisable, as they offer a more comprehensive understanding of how functions are utilized in real-world scenarios. Collaboration with developers is crucial to establish the appropriate input space for the functions under verification. Comprehensive code coverage serves as a reliable metric for assessing the effectiveness of verification efforts. Additionally, harness development provides an opportunity to review the codebase, allowing for the removal of unused public functions to minimize potential attack surfaces to unsafe code.

Loop Stubbing. Loop stubbing is highly effective for large codebases with complex loops. It addresses challenges that BMC encounters when analyzing intricate loops, facilitating in-depth analysis. The key approach is to maintain memory safety checks within loops while approximating external side effects, ensuring thorough memory safety analysis without compromising soundness.

Function Verification Order and Function Stubbing. In real-world codebases like TECC, managing the verification order of complex and frequently-invoked functions, such as chebyshev, can significantly boost BMC efficiency. Our inter-procedural scheduling, which includes function stubbing, cuts TECC’s verification time by about threefold compared to the non-stubbing alternative.

False Positives. Although over-approximation in loop and function stubbing typically leads to false positives, our verification of TECC did not produce any false positives during the stubbing process. This is because TECC’s loops and functions operate on unrestricted values within their respective domains.

Stubbing Generation. Stubbing in TECC verification applies to both functions and loops. Function stubbing automatically over-approximates side effects using unconstrained values, which is effective given TECC functions operate within specific unconstrained domains. Loop stubbing, which is currently semi-automatic, over-approximates side effects but requires manual effort to ensure completeness of memory safety checks. It could potentially be fully automated. In the context of Rust’s MIR being lowered to LLVM IR during compilation, implementing automatic safety checks could involve developing LLVM passes that identify loop side effects, approximate them with unconstrained domain values, and insert memory safety checks before each loop memory access.

Rust-Specific Memory Safety. Ensuring memory safety in a Rust codebase with unsafe code requires verifying both the unsafe and the impacted safe Rust code. For TECC, we verified an additional 4,099 lines of safe Rust, which is 36% more than the unsafe code. Minimizing the use of unsafe code is crucial to simplify the verification process and reduce memory safety risks.

Feedback from Developers. During the TECC verification process, developers valued our formal verification practice, especially for identifying 39 memory-safety issues that were overlooked by their existing static and dynamic analysis tools. They now plan to incorporate this verification process into their daily development routines to enhance memory safety.

Limitations. We employed Kani-0.22.0 [15] for model checking the Rust codebase. Kani provides interfaces for function stubbing, assigns value ranges to symbolic variables, and supports user property assertions. However, it could benefit from incorporating function contracts and loop invariants. The kani::stub() function has limitations, especially with trait methods. Additionally, Kani does not yet support all documented undefined behaviors. Moreover, when Kani stalls during loop analysis, providing output that specifies the loop and progress would aid users in identifying where stubbing is necessary.

5 Conclusion

We introduce UnsafeCop, an approach for ensuring memory safety in real-world Rust projects with unsafe Rust code. UnsafeCop identifies functions exposed to other crates that execute unsafe code, transforms tests into harnesses, determines loop bounds using abstract interpretation, and applies loop stubbing for large-bounded loops with memory-safety checks. For bounded model checking, we employ Kani, optimizing the verification order of functions and incorporating function stubbing to enhance performance. We tested our approach on TECC, a real-world project combining trusted computing with secure multi-party computation, consisting of 30,174 lines of Rust code, of which 3,019 are unsafe. UnsafeCop identified and verified 39 memory safety issues, reducing verification time by 73.71% compared to the non-stubbing alternative. These results demonstrate the effectiveness of UnsafeCop in improving memory safety in Rust projects.