figure a
figure b

1 Introduction

Rust [33] is a modern programming language which features an exciting combination of memory safety and low-level control. In particular, Rust takes inspiration from ownership types to restrict the mutation of shared state. The Rust compiler is able to statically verify the corresponding ownership constraints and consequently guarantee memory and thread safety. This distinctive advantage of provable safety makes Rust a very popular language, and the prospect of migrating legacy codebases in C to Rust is very appealing.

In response to this demand, automated tools translating C code to Rust emerge from both industry and academia [17, 26, 31]. Among them, the industrial strength translator C2Rust [26] rewrites C code into the Rust syntax while preserving the original semantics. The translation does not synthesise an ownership model and thus is not able to do more than replicating the unsafe use of pointers in C. Consequently, the Rust code must be labelled with the keyword which allows certain actions that are not checked by the compiler. More recent work focuses on reducing this unsafe labelling. In particular, the tool Laertes [17] aims to rewrite the (unsafe) code produced by C2Rust by searching the solution space guided by the type error messages from the Rust compiler. This is impressive, as for the first time proper Rust code beyond a line-by-line direct conversion from the original C source may be synthesised. On the other hand, the limit of the trial-and-error approach is also clear: the system does not support the reasoning of the generation process, nor create any new understanding of the target code (other than the fact that it compiles successfully).

In this paper, we take a more principled approach by developing a novel ownership analysis of pointers that is efficient (scaling to large programs (half a million LOC in less than 10 s)), sophisticated (handling nested pointers and inductively-defined data structures), and precise (being field and flow sensitive). Our ownership analysis makes a strengthening assumption about the Rust ownership model, which obviates the need for an aliasing analysis. While this assumption excludes a few safe Rust uses (see discussion in Sect. 5), it ensures that the ownership analysis is both scalable and precise, which is subsequently reflected in the overall scalability and precision of the C to Rust translation.

The primary goal of this analysis is of course to facilitate the C to Rust translation. Indeed, as we will see in the rest of the paper, an automated translation system is built to encode the ownership models in the generated Rust code which is then proven safe by the Rust compiler. However, in contrast to trying the Rust compiler as common in existing approaches [17, 31], this analysis approach actually extracts new knowledge about ownership from code, which may lead to other future utilities including preventing memory leaks (currently allowed in safe Rust), identifying inherently unsafe code fragments, and so on. Our current contributions are:

  • design a scalable and precise ownership analysis that is able to handle complex inductively-defined data structures and nested pointers. (Section 5)

  • develop a refactoring technique for Rust leveraging ownership analyses to enhance code safety. While in this paper we focus on applying our technique to the translation from C to Rust, it can be used to improve the safety of any unsafe Rust code. (Section 6)

  • implement a prototype tool (Crown, standing for C to Rust OWNership guided translation) that translates C code into Rust with enhanced safety. (Section 7)

  • evaluate Crown with a benchmark suite including commonly used data structure libraries and real-world projects (ranging from 150 to half a million LOC) and compare the result with the state-of-the-art. (Section 8)

2 Background

We start by giving a brief introduction of Rust, in particular its ownership system and the use of pointers, as they are central to memory safety.

2.1 Rust Ownership Model

Ownership in Rust denotes a set of rules that govern how the Rust compiler manages memory [33]. The idea is to associate each value with a unique owner. This feature is useful for memory management. For example, when the owner goes out of scope, the memory allocated for the value can be automatically recycled.

figure d

In the above snippet, the assignment of to also transfers ownership, after which it is illegal to access until it is re-assigned a value again.

This permanent transfer of ownership gives strong guarantees but can be cumbersome to manage in programming. In order to allow sharing of values between different parts of the program, Rust uses the concept of borrowing, which refers to creating a reference (marked by an ampersand). A reference allows referring to some value without taking ownership of it. Borrowing gives the temporary right to read and, potentially, uniquely mutate the referenced value.

This concept of time creates another dimension of ownership management known as lifetime. For mutable references (as marked by in the above examples), only one reference is allowed at a time. But for immutable references (the ones without the marking), multiple of them can coexist as long as there isn’t any mutable reference at the same time. As one can expect, this interaction of mutable and immutable references, and their lifetimes is highly non-trivial. In this paper, we focus on analysing mutable references.

2.2 Pointer Types in Rust

Rust has a richer pointer system than C. The primitive C-style pointers (written as or ) are known as raw pointers, which are ignored by the Rust compiler for ownership and lifetime checks. Raw pointers are a major source of unsafe Rust (more below). Idiomatic Rust instead advocates box pointers (written as ) as owning pointers that uniquely own heap allocations, as well as references (written as or as discussed in the previous subsection) as non-owning pointers that are used to access values owned by others. Rust also offers smart pointers for which the borrow rules are checked at runtime (e.g. ). We aim for our translation to maintain CPU time without additional runtime overhead, and therefore we do not refactor raw pointers into s.

C-style array pointers are represented in Rust as references to arrays and slice references, with array bounds known at compile time and runtime, respectively. The creation of meta-data such as array bounds is beyond the scope of ownership analysis. In this work, we keep array pointers as raw pointers in the translated code.

2.3 Unsafe Rust

As a pragmatic design, Rust allows programs to contain features that cannot be verified by the compiler as memory safe. This includes dereferencing raw pointers, calling low level functions, and so on. Such uses must be marked with the keyword and form fragments of unsafe Rust. It is worth noting that does not turn off all compiler checks; safe pointers are still checked.

Unsafe Rust is often used to implement data structures with complex sharing, overcome incompleteness issues of the Rust compiler, and support low-level systems programming [2, 18]. But it can also be used for other reasons. For example, c2rust  [26] directly translates C pointers into raw pointers. Without unsafe Rust, the generated code would not compile.

3 Overview

In this section, we present an overview of Crown via two examples. The first example provides a detailed description of the method for a singly-linked list, whereas the second shows a snippet from a real-world benchmark.

Fig. 1.
figure 1

Pushing into a singly-linked list

3.1 Pushing into a Singly-Linked List

The C code of function in Fig. 1a allocates a new node where it stores the data received as argument. The new node subsequently becomes the head of . This code is translated by to the Rust code in Fig. 1b. Notably, the translation is syntax-based and simply changes all the C pointers to raw pointers. Given that dereferencing raw pointers is considered an unsafe operation in Rust (e.g. the dereferencing of at line 16 in Fig. 1b), the method must be annotated with the keyword (alternatively, it could be placed inside an block). Additionally, introduces two directives for the two struct definitions, and . The former keeps the data layout the same as in C for possible interoperation, and the latter instructs that the corresponding type can only be duplicated through copying.

While uses raw pointers in the translation, the ownership scheme in Fig. 1b obeys the Rust ownership model, meaning that the raw pointers could be translated to safe ones. A pointer to a newly allocated node is assigned to at line 15. This allows us to infer that the ownership of the newly allocated node belongs to . Then, at line 18, the ownership is transferred from to . Additionally, if owns any memory object prior to line 17, then its ownership is transferred to at line 17. This ownership scheme corresponds to safe pointer use: (i) each memory object is associated with a unique owner and (ii) it is dropped when its owner goes out of scope. As an illustration for (i), when the ownership of the newly allocated memory is transferred from to at line 18, becomes the unique owner, whereas is made invalid and it is no longer used. For (ii), given that argument of is an output parameter (i.e. a parameter that can be accessed from outside the function), we assume that it must be owning on exit from the method. Thus, no memory object is dropped in the method, but rather returned to the caller.

Crown infers the ownership information of the code translated by , and uses it to translate the code to safer Rust in Fig. 1c. As explained next, Crown first retypes raw pointers into safe pointers based on the ownership information, and then rewrites their uses.

Retyping Pointers in . If a pointer owns a memory object at any point within its scope, Crown retypes it into a pointer. For instance, in Fig. 1c, local variable is retyped to be (safe pointer types are wrapped into Option to account for null pointer values). Variable is non-owning upon function entry, becomes owning at line 13 and ownership is transferred out again at line 16.

For struct fields, Crown considers all the code in the scope of the struct declaration. If a struct field owns a memory object at any point within the scope of its struct declaration, then it is retyped to . In Fig. 1b, fields and are accessed via access paths and , and given ownership at lines 17 and 18, respectively. Consequently, they are retyped to at lines 4 and 9 in Fig. 1c, respectively.

A special case is that of output parameters, e.g. in our example. For such parameters, although they may be owning, Crown retypes them to in order to enable borrowing. In , the input argument is retyped to .

Rewriting Pointer Uses in Crown . After retyping pointers, Crown rewrites their uses. The rewrite process takes into consideration both their new type and the context in which they are being used. Due to the Rust semantics, the rewrite rules are slightly intricate (see Sect. 6). For instance, the dereference of at line 14 is rewritten to as it needs to be mutated and the optional part of the needs to be unwrapped. Similarly, at line 15, is rewritten to be as the LHS of the assignment expects a pointer.

After the rewrite performed by Crown, the block annotation is not needed anymore. However, Crown does not attempt to remove such annotations. Notably, safe pointers are always checked by the Rust compiler, even inside blocks.

3.2 Freeing an Argument List in bzip2

We next show the transformation of a real-world code snippet with a loop structure: a piece of code in that frees argument lists. defines a singly-linked list like structure, , that holds a list of argument names. In Fig. 2, we extract from the source code a snippet that frees the argument lists. Here, the local variable is an already constructed argument list, and is a type alias to C-style characters. As a note, in Figs. 2b and 2c does not refer to Rust’s .

Fig. 2.
figure 2

Freeing an argument list

Crown accurately infers an ownership scheme for this snippet. Firstly, ownership of is transferred to , which is to be freed in the subsequent loop. Inside the loop, ownership of   accessed from is firstly transferred to , then ownership of accessed from is released in a call to . After the conditional, ownership of is also released. Last of all, regains ownership from .

Handling of Loops. For loops, Crown only analyses their body once as that will already expose all the ownership information. For inductively defined data structures such as , while further unrolling of loop bodies explores the data structures deeper, it does not expose any new struct fields: pointer variables and pointer struct fields do not change ownership between loop iterations. Additionally, Crown emits constraints that equate the ownership of all local pointers at the loop entry and exit. For example, the ownership statuses of and at loop entry are made equal with those at loop exit, and inferred to be owning and non-owning, respectively.

Handling of Null Pointers. It is a common C idiom for pointers to be checked against null after malloc or before free: . This could be problematic since the then-branch and the else-branch would have conflicting ownership statuses for . We adopt a similar solution as [24]: we insert an explicit null assignment in the null branch  . As we treat null pointers as both owning and non-owning, the ownership of will be dictated by the non-null branch, enabling Crown to infer the correct ownership scheme.

Translation. With the above ownership scheme, Crown performs the rewrites as in Fig. 2c. Note that we do not attempt to rewrite since it is an array pointer (see Sect. 7 for limitations).

4 Architecture

In this section, we give a brief overview of Crown ’s architecture. Crown takes as input a Rust program with unsafe blocks, and outputs a safer Rust program, where a portion of the raw pointers have been retyped as safe ones (in accordance to the Rust ownership model), and their uses modified accordingly. In this paper we focus on applying our technique to programs automatically translated by c2rust, which maintain a high degree of similarity to the original C ones, where the C syntax is replaced by Rust syntax.

Crown applies several static analyses on the MIR of Rust to infer properties of pointers:

  • Ownership analysis: computes ownership information about the pointers in the code, i.e. for each pointer it infers whether it is owning/non-owning at particular program locations.

  • Mutability analysis: infers which pointers are used to modify the object they point to (inspired by [22, 25]).

  • Fatness analysis: distinguishes array pointers from non-array pointers (inspired by [32]).

The results of these analyses are summarised as type qualifiers [21]. A type qualifier is an atomic property (i.e., ownership, mutability, and fatness) that ‘qualifies’ the standard pointer type. These qualifiers are then utilised for pointer retyping. For example, an owning, non-array pointer is retyped to . After pointers have been retyped, Crown rewrites their usages accordingly.

5 Ownership Analysis

The goal of our ownership analysis is to compute an ownership scheme for a given program that obeys the Rust ownership model, if such a scheme exists. The ownership scheme contains information about whether pointers in the program are owning or non-owning at particular program locations. At a high-level, our analysis works by generating a set of ownership constraints (Sect. 5.2), which are then solved by a SAT solver (Sect. 5.3). A satisfying assignment for the ownership constraints is an ownership scheme that obeys the Rust semantics.

Our ownership analysis is flow and field sensitive, where the latter enables inferring ownership information for pointer struct fields. To satisfy field sensitivity, we track ownership information for access paths [10, 14, 29]. An access path represents a memory location by the way it is accessed from an initial, base variable, and comprises of the base variable and a sequence of field selection operators. For the program Fig. 1b, some example access paths are (consists only of the base variable), , and . Our analysis associates an ownership variable with each access path, e.g. has associated ownership variable \(\mathbb {O}_{p}\), and has associated ownership variable \(\mathbb {O}_{(*p).next}\). Each ownership variable can take the value 1 if the corresponding access path is owning, or 0 if it is non-owning. By ownership of an access path we mean the ownership of the field (or, more generally, pointer) accessed last through the access path, e.g. the ownership of refers to the ownership of field .

5.1 Ownership and Aliasing

One of the main challenges of designing an ownership analysis is the interaction between ownership and aliasing. To understand the problem, let us consider the pointer assignment at line 3 in the code listing below. We assume that the lines before the assignment allow inferring that , and are owning, whereas and are non-owning. Additionally, we assume that the lines after the assignment require to be owning (e.g. is being explicitly freed). From this, an ownership analysis could reasonably conclude that ownership transfer happens at line 3 (such that becomes owning), and the inferred ownership scheme obeys the Rust semantics.

figure di

Let’s now also consider aliasing. A possible assumption is that, just before line 3, and alias, meaning that and also alias. Then, after line 3, and will still alias (pointing to the same memory object). However, according to the ownership scheme above, both and are owning, which is not allowed in Rust, where a memory object must have a unique owner. This discrepancy was not detected by the ownership analysis mimicked above. The issue is that the ownership analysis ignored aliasing. Indeed, ownership should not be transferred to if there exists an owning alias that, after the ownership transfer, continues to point to the same memory object as .

Precise aliasing information is very difficult to compute, especially in the presence of inductively defined data structures. In the current paper, we alleviate the need to check aliasing by making a strengthening assumption about the Rust ownership model: we restrict the way in which pointers can acquire ownership along an access path, thus limiting the interaction between ownership and aliasing. In particular, we introduce a novel concept of ownership monotonicity. This property states that, along an access path, the ownership values of pointers can only decrease (see Definition 1, where \(\textit{is}\_\textit{prefix}(a, b)\) returns true if access path a is a prefix of b, and false otherwise – e.g. is_prefix() = true). Going back to the previous code listing, the ownership monotonicity implies that, for access path we have \(\mathbb {O}_{\texttt {p}} \ge \mathbb {O}_\texttt {(*p).next}\), and for access path we have \(\mathbb {O}_\texttt {q} \ge \mathbb {O}_\texttt {(*q).next}\). This means that, if is allowed to take ownership, then must already be owning. Consequently, all aliases of must be non-owning, which means that all aliases of , including , are non-owning.

Definition 1 (Ownership monotonicity)

Given two access paths a and b, if \(\textit{is}\_\textit{prefix}(a, b)\), then \(\mathbb {O}_{a} \ge \mathbb {O}_b\).

Ownership monotonicity is stricter than the Rust semantics, causing our analysis to reject two scenarios that would otherwise be accepted by the Rust compiler (see discussion in Sect. 5.4). In this work, we made the design decision to use ownership monotonicity over aliasing analysis as it allows us to retain more control over the accuracy of the translation. Conversely, using an aliasing analysis would mean that the accuracy of the translation is directly dictated by the accuracy of the aliasing analysis (i.e. false alarms from the aliasing analysis [23, 40] would result in Crown not translating pointers that are actually safe). With ownership monotonicity, we know exactly what the rejected valid ownership schemes are, and we can explicitly enable them (again, see discussion in Sect. 5.4).

5.2 Generation of Ownership Constraints

During constraint generation, we assume a given k denoting the length of the longest access path used in the code. This enables us to capture the ownership of all the access paths exposed in the code. Later in this section, we will discuss the handling of loops, which may expose longer access paths.

Next, we denote by \(\mathcal {P}\) the set of all access paths in a program, \(base\_var(a)\) returns the base variable of access path a, and |a| computes the length of the access path a in terms of applied field selection operators from the base variable. In the context of the previous code listing, \(base\_var(\texttt {(*p).next}) = \texttt {p}\), \(base\_var(\texttt {p}) = \texttt {p}\), \(|\texttt {p}| = 1\) and \(|\texttt {(*p).next}| = 2\). Then, we define ap(vlbub) to return the set of access paths with base variable v and length in between lower bound lb and upper bound ub: . For illustration, we have \(ap(\texttt {p}, 1, 2) = \left\{ \texttt {p}, \texttt {(*p).next}\right\} \).

Fig. 3.
figure 3

Ownership constraint generation for assignment

Ownership Transfer. The program instructions where ownership transfer can happen are (pointer) assignment and function call. Here we discuss assignment and, due to space constraints, we leave the rules for interprocedural ownership analysis in the extended version [41]. Our rule for ownership transfer at assignment site follows Rust’s semantics: when a pointer is moved, the object it points to is moved as well. For instance, in the following Rust pseudocode snippet:

figure ee

when ownership is transferred from to also loses ownership. Except for reassignment, the use of a pointer after it lost its ownership is disallowed, hence the use of or is forbidden at line 3.

Consequently, we enforce the following ownership transfer rule: if ownership transfer happens for a pointer variable (e.g. and in the example), then it must happen for all pointers reachable from that pointer (e.g. and ). The ownership of pointer variables from which the pointer under discussion is reachable remains the same (e.g. if ownership transfer happens for some assignment in the code, then and retain their respective previous ownership values).

Possible Ownership Transfer at Pointer Assignment: The ownership transfer rule at pointer assignment site is captured by rule ASSIGN in Fig. 3. The judgement \(C\vdash \texttt {p = q;}\Rightarrow C'\) denotes the fact that the assignment is analysed under the set of constraints C, and generates \(C'\). We use prime notation to denote variables after the assignment. Given pointer assignment \(\texttt {p = q}\), a and b represent all the access paths respectively starting from \(\texttt {p}\) and \(\texttt {q}\), whereas c and d denote the access paths from the base variables of \(\texttt {p}\) and \(\texttt {q}\) that reach \(\texttt {p}\) and \(\texttt {q}\), respectively. Then, equality \(\mathbb {O}_{a'} + \mathbb {O}_{b'}=\mathbb {O}_{b}\) captures the possibility of ownership transfer for all access paths originating at \(\texttt {p}\) and \(\texttt {q}\): (i) If transfer happens then the ownership of b transfers to \(a'\) (\(\mathbb {O}_{a'}=\mathbb {O}_{b}\) and \(\mathbb {O}_{b'}=0\)). (ii) Otherwise, the ownership values are left unchanged (\(\mathbb {O}_{a'}=\mathbb {O}_{a}\) and \(\mathbb {O}_{b'}=\mathbb {O}_{b}\)). The last two equalities, \( \mathbb {O}_{c'}=\mathbb {O}_{c} \wedge \mathbb {O}_{d'}=\mathbb {O}_{d}\), denote the fact that, for both (i) and (ii), pointers on access paths c and d retain their previous ownership. Note that “\(+\)” is interpreted as the usual arithmetic operation over \(\mathbb {N}\), where we impose an implicit constraint \(0 \le \mathbb {O} \le 1\) for every ownership variable \(\mathbb {O}\).

C Memory Leaks: In the ASSIGN rule, we add constraint \(\mathbb {O}_{a}=0\) to \(C'\) in order to force a to be non-owning before the assignment. Conversely, having a owning before being reassigned via the assignment under analysis signals a memory leak in the original C program. Given that in Rust memory is automatically returned, allowing the translation to happen would change the semantics of the original program by fixing the memory leak. Instead, our design choice is to disallow the ownership analysis from generating such a solution. As we will explain in Sect. 8, we intend for our translation to preserve memory usage (including possible memory leaks).

Simultaneous Ownership Transfer Along an Access Path: One may observe that the constraints generated by ASSIGN do not fully capture the stated ownership transfer rule. In particular, they do not ensure that, whenever ownership transfer occurs from \(\texttt {p}\) to \(\texttt {q}\), it also transfers for all pointers on all access paths a and b. Instead, this is implicitly guaranteed by the ownership monotonicity rule, as stated in Theorem 1.

Theorem 1 (Ownership transfer)

If ownership is transferred from p to q, then, by the ASSIGN rule and ownership monotonicity, ownership also transfers between corresponding pointers on all access paths a and b: \(\mathbb {O}_{a'} =\mathbb {O}_{b}\) and \(\mathbb {O}_{b'}=0\). (proof in the extended version [41])

Ownership and Aliasing: We saw in Sect. 5.1 that aliasing may cause situations in which, after ownership transfer, the same memory object has more than one owner. Theorem 2 states that this is not possible under ownership monotonicity.

Theorem 2 (Soundness of pointer assignment under ownership monotonicity)

Under ownership monotonicity, if all allocated memory objects have a unique owner before a pointer assignment, then they will also have a unique owner after the assignment. (proof in the extended version [41])

Intuitively, Theorem 2 enables a pointer to acquire ownership without having to consider aliases: after ownership transfer, this pointer will be the unique owner. The idea resembles that of strong updates [30].

Additional Access Paths: As a remark, it is possible for \(\texttt {p}\) and \(\texttt {q}\) to be accessible from other base variables in the program. In such cases, given that those access paths are not explicitly mentioned at the location of the ownership transfer, we do not generate new ownership variables for them. Consequently, their current ownership variables are left unchanged by default.

Ownership Transfer Example. To illustrate the ASSIGN rule, we use the singly-linked list example below, where we assume that are both of type . Therefore, we will have to consider the following four access path . In SSA-style, at each line in the example, we generate new ownership variables (by incrementing their subscript) for the access paths mentioned at that line. For the first assignment, ownership transfer can happen between \(\texttt {p}\) and \(\texttt {q}\), and \(\texttt {(*p).next}\) and \(\texttt {(*q).next}\), respectively. For the second assignment, ownership can be transferred between \(\texttt {(*p).next}\) and \(\texttt {(*q).next}\), while \(\texttt {p}\) and \(\texttt {q}\) must retain their previous ownership.

figure eu

Besides generating ownership constraints for assignments, we must model the ownership information for commonly used C standard function like , etc. Due to space constraints, more details about these, as well as the rules for ownership monotonicity and interprocedural ownership analysis are provided in the extended version [41].

Handling Conditionals and Loops. As mentioned in Sect. 3.2, we only analyse the body of loops once as it is sufficient to expose all the required ownership variables. For inductively defined data structures, while further unrolling of loop bodies increases the length of access paths, it does not expose any new struct fields (struct fields do not change ownership between loop iterations).

To handle join points of control paths, we apply a variant of the SSA construction algorithm [6], where different paths are merged via \(\phi \) nodes. The value of each ownership variable must be the same on all joined paths, or otherwise the analysis fails.

5.3 Solving Ownership Constraints

The ownership constraint system consists of a set of 3-variable linear constraints of the form \(O_v = O_w + O_u\), and 1-variable equality constraints \(O_v=0\) and \(O_v=1\).

Definition 2 (Ownership constraint system)

An ownership constraint system \((P, \varDelta , \varSigma , \varSigma _{\lnot })\) consists of a set of ownership variables P that can have either value 0 or 1, a set of 3-variable equality constraints \(\varDelta \subseteq P \times P \times P\), and two sets of 1-variable equality constraints, \(\varSigma , \varSigma _{\lnot } \subseteq P\). The equalities in \(\varSigma \) are of the form \(x=1\), whereas the equalities in \(\varSigma _{\lnot }\) are of the form \(x=0\).

Theorem 3 (Complexity of the ownership constraint solving)

Deciding the satisfiability of the ownership constraint system in Definition 2 is NP-complete. (proof in the extended version [41]).

We solve the ownership constraints by calling a SAT solver. The ownership constraints may have no solution. This happens when there is no ownership scheme that obeys the Rust ownership model and the ownership monotonicity property (which is stricter than the Rust model for some cases), or the original C program has a memory leak. In the case where the ownership constraints have more than one solution, we consider the first assignment returned by the SAT solver.

Due to the complex Rust semantics, we do not formally prove that a satisfying assignment obeys the Rust ownership model. Instead, this check is performed after the translation by running the Rust compiler.

5.4 Discussion on Ownership Monotonicity

As mentioned earlier in Sect. 5, ownership monotonicity is stricter than the Rust semantics, causing our analysis to potentially reject some ownership schemes that would otherwise be accepted by the Rust compiler. We identified two such scenarios:

(i) Reference output parameter: This denotes a reference passed as a function parameter, which acts as an output as it can be accessed from outside the function (e.g. in Fig. 1a). For such parameters, the base variable is non-owning (as it is a reference) and mutable, whereas the pointers reachable from it may be owning (see example in Fig. 1c, where gets assigned a pointer to a newly allocated node). We detect such situations and explicitly enable them. In particular, we explicitly convert owning pointers to at the translation stage.

(ii) Local borrows: The code below involving a mutable local borrow is not considered valid by Crown as it disobeys the ownership monotonicity: after the assignment, is non-owning, whereas is owning.

figure fc

While we could explicitly handle the translation to local borrows, in order to do so soundly, we would have to reason about lifetime information (e.g. Crown would have to check that there is no overlap between the lifetimes of different mutable references to the same object). In this work, we chose not to do this and instead leave it as future work (as also mentioned under limitations in Sect. 7). It was observed in [13] that scenario (i) is much more prevalent than scenario (ii). Additionally, we observed in our benchmarks that output parameter accounts for 93% of mutable references (hence the inclusion of a special case enabling the translation of scenario (i) in Crown).

6 C to Rust Translation

Crown uses the results of the ownership, mutability and fatness analyses to perform the actual translation, which consists of retyping pointers (Sect. 6.1) and rewriting pointer uses (Sect. 6.2).

6.1 Retyping Pointers

As mentioned in Sect. 2.2, we do not attempt to translate array pointers to safe pointers. In the rest of the section, we focus on mutable, non-array pointers.

The translation requires a global view of pointers’ ownership, whereas information inferred by the ownership analysis refers to individual program locations. For the purpose of translation, given that we refactor owning pointers into box pointers, a pointer is considered (globally) owning if it owns a memory object at any program location within its scope. Otherwise, it is (globally) non-owning. When retyping pointer fields of structs, we must consider the scope of the struct declaration, which generally transcends the whole program. Within this scope, each field is usually accessed from several base variables, which must all be taken into consideration. For instance, given the declaration in Fig. 1b and two variables and of type . Then, in order to determine the ownership status of field , we have to consider all the access paths to originating from both base variables and .

The next table shows the retyping rules for mutable, non-array pointers, where we wrap safe pointer types into to account for null pointer values:

 

Non-array pointers

Owning

Non-owning

or

The non-owning pointers that are kept as raw pointers correspond to mutable local borrows. As explained in Sects. 5.4 and 7, Crown doesn’t currently handle the translation to mutable local borrows due to the fact that we do not have a lifetime analysis. Notably, this restriction does not apply to output parameters (which covers the majority of mutable references), where we translate to mutable references. The lack of a lifetime analysis means that we also can’t handle immutable local borrows, hence our translation’s focus on mutable pointers.

6.2 Rewriting Pointer Uses

The rewrite of a pointer expression depends on its new type and the context in which it is used. For example, when rewriting in , the context will depend on the new type of . Based on this new type, we can have four contexts: \(\textsf {BoxCtxt}\) which requires pointers, \(\textsf {MutCtxt}\) which requires references, \(\textsf {ConstCtxt}\) which requires references, and \(\textsf {RawCtxt}\) which requires raw pointers. For example, if above is a pointer, then we rewrite in a \(\textsf {BoxCtxt}\).

Then, the rewrite takes place according to the following table, where columns correspond to the new type of the pointer to be rewritten, and rows represent possible contextsFootnote 1.

figure fz

Our translation uses functions from the Rust standard library, as follows:

  1. 1.

    When is passed to a \(\textsf {BoxCtxt}\), we expect a move, and consequently we use to replace the value inside the option with None;

  2. 2.

    We use and in order to not consume the original option, and we create new options with references to the original ones;

  3. 3.

    and converts raw pointers to references;

  4. 4.

    converts raw pointers into pointers.

We also define the helper function that transform safe pointers into raw pointers:

figure gj

Here, we explain for a argument (the explanation for is the same because of the polymorphic nature of ):

  1. 1.

    To convert , we first mutably borrow the entire option as denoted by the mutable borrow argument of the helper function. This is needed because is not copyable, and it would be otherwise consumed;

  2. 2.

    converts to ;

  3. 3.

    converts the optional part of the reference into an option of raw pointers;

  4. 4.

    Finally, returns the value of the option, or a null pointer if the value is .

Dereferences: When a pointer is dereferenced as part of a larger expression (e.g. ), we need an additional .

Box pointers check: Rust disallows the use of pointers after they lost their ownership. As this rule cannot be captured by the ownership analysis, such situations are detected at translation stage, and the culpable pointers are reverted back to raw pointers.

For brevity, we omitted the slightly different treatment of struct fields that are not of pointer type.

7 Challenges of Handling Real-World Code

We designed Crown to be able to analyse and translate real-world code, which poses significant challenges. In this section, we discuss some of the engineering challenges of Crown and its current limitations.

7.1 Preprocessing

During the transpilation of C libraries, c2rust treats each file as a separate compilation unit, which gets translated into a separate Rust module. Consequently, struct definitions are duplicated, and available function definitions are put in blocks [17]. We apply a preprocessing step similar to the resolve-imports tool of Laertes [17] that links those definitions across files.

7.2 Limitations of the Ownership Analysis

There are a few C constructs and idioms that are not fully supported by our implementation, for which Crown generates partial ownership constraints. Crown ’s translation will attempt to rewrite a variable as long as there exists a constraint involving it. As a result, the translation is in theory neither sound nor complete: it may generate code that does not compile (though we have not observed this in practice for the benchmarks where Crown produces a result – see Sect. 8) and it may leave some pointers as raw pointers resulting in a less than optimal translation. We list below the cases when such a scenario may happen.

Certain Unsafe C Constructs. For type casts, we only generate ownership transfer constraints for head pointers; for unions we assume that they contain no pointer fields and consequently, we generate no constraints; similarly, we generate no constraints for variadic arguments. We noticed that unions and variadic arguments may cause our tool to crash (e.g. three of the benchmarks in [17], as mentioned in Sect. 8). Those crashes happen when analysing access paths that contain dereferences of union fields (where we assumed no pointer fields), and when analysing calls to functions with variadic arguments where a pointer is passed as argument.

Function Pointers. Crown does not generate any constraints for them.

Non-standard Memory Management in C Libraries. Certain C libraries wrap and , often with static function pointers (pointers to allocator/deallocator are stored in static variables), or function pointers in structs. Crown does not generate any constraints in such scenarios. In C, it is also possible to use to allocate a large piece of memory, and then split it into several sub-regions assigned to different pointers. In our ownership analysis, only one pointer can gain ownership of the memory allocated by a call to . Another C idiom that we don’t fully support occurs when certain pointers can point to either heap allocated objects, or statically allocated stack arrays. Crown generates ownership constraints only for the heap and, consequently, those variables will be left under-constrained.

7.3 Other Limitations of Crown

Array Pointers. For array pointers, although Crown infers the correct ownership information, it does not generate the meta data required to synthesise Rust code.

Mutable Local Borrows. As explained in the last paragraph of Sect. 6.1, Crown does not translate mutable non-owning pointers to local mutable references as this requires dedicated analysis of lifetimes. Note that Crown does however generate mutable references for output parameters.

Access Paths that Break Ownership Monotonicity. As discussed in Sect. 5.4, ownership monotonicity may be stricter in certain cases than Rust’s semantics.

8 Experimental Evaluation

We implement Crown on top of the Rust compiler, version nightly-2023-01-26. We use c2rust with version 0.16.0. For the SAT solver, we rely on a Rust-binding of z3 [20] with version 0.11.2. We run all our experiments on a MacBook Pro with an Apple M1 chip, with 8 cores (4 performance and 4 efficiency). The computer has 16 GB RAM and runs macOS Monterey 12.5.1.

Benchmark Selection. To evaluate the utility of Crown, we collected a benchmark suite of 20 programs (Table 1). These include benchmarks from Laertes [17]’s accompanying artifact [16] (marked by * in Table 1)Footnote 2, and additionally 8 real-world projects (binn, brotli, buffer, heman, json.h, libtree, lodepng, rgba) together with 4 commonly used data structure libraries (avl, bst, ht, quadtree).

Functional and Non-functional Guarantees. With respect to functional properties, we want the original program and the refactored program to be observationally equivalent, i.e. for each input they produce the same output. We empirically validated this using all the available test suites (i.e. for libtree, rgba, quadtree, urlparser, genann, buffer in Table 1). All the test suites continue to pass after the translation. For nonfunctional properties, we intend to preserve memory usage and CPU time, i.e. we don’t want our translation to introduce runtime overhead. We also validated this using the test suites.

Table 1. Benchmarks information

8.1 Research Questions

We aim at answering the following research questions.

  • RQ1. How many raw pointers/pointer uses can Crown translate to safe pointers/pointer uses?

  • RQ2. How does Crown ’s result compare with the state-of-the-art [17]?

  • RQ3. What is the runtime performance of Crown?

RQ 1: Unsafe pointer reduction. In order to judge Crown ’s efficacy, we measure the reduction rate of raw pointer declarations and uses. This is a direct indicative of the improvement in safety, as safe pointers are always checked by the Rust compiler (even inside regions). As previously mentioned, we focus on mutable non-array pointers. The results are presented in Table 2, where #ptrs counts the number of raw pointer declarations in a given benchmark, #uses counts the number of times raw pointers are being used, and the Laertes and Crown headers denote the reduction rates of the number of raw pointers and raw pointer uses achieved by the two tools, respectively. For instance, for benchmark avl, the rate of 100% means that all raw pointer declarations and their uses are translated into safe ones. Note that the “-” symbols on the row corresponding to are due to the fact that the benchmark contains 0 raw pointer uses.

The median reduction rates achieved by Crown for raw pointers and raw pointer uses are 37.3% and 62.1%, respectively. Crown achieves a 100% reduction rate for many non-trivial data structures (), as well as for . For brotli, a lossless data compression algorithm developed by Google, which is our largest benchmark, Crown achieves reduction rates of 21.4% and 20.9%, respectively. The relatively low reduction rates for and a few other benchmarks (tulipindicators, lodepng, bzip2, genann, libzahl) is due to their use of non-standard memory management strategies (discussed in detail in Sect. 7).

Notably, all the translated benchmarks compile under the aforementioned Rust compiler version. As a check of semantics preservation, for the benchmarks that provide test suites (), our translated benchmarks pass all the provided tests.

Table 2. Reduction of (mutable, non-array) raw pointer declarations and uses

RQ 2: Comparing with state-of-the-art. The comparison of Crown with Laertes [17] is also shown in Table 2, with bold font highlighting better results. The data on Laertes is either directly extracted from the artifact [16] or has been confirmed by the authors through private correspondence. We can see that Crown outperforms the state-of-the-art (often by a significant degree) in most cases, with lodepng being the only exception, where we suspect that the reason also lies with non-standard memory management strategies mentioned before. Laertes is less affected by this as it does not rely on ownership analysis.

RQ 3: Runtime performance. Although our analysis relies on solving a constraint satisfaction problem that is proven to be NP-complete, in practice the runtime performance of Crown is consistently high. The execution time of the analysis and the rewrite for the whole benchmark suite is within 60 s (where the execution time for our largest benchmark, brotli, is under 10 s).

9 Related Works

Ownership Discussion. Ownership has been used in OO programming to enable controlled aliasing by restricting object graphs underlying the runtime heap [11, 12] with efforts made in the automatic inference of ownership information [1, 4, 39], and applications of ownership to memory management  [5, 42]. Similarly, the concept of ownership has also been applied to analyse C/C++ programs. Heine et al. [24] inferred pointer ownership information for detecting memory leaks. Ravitch et al. [37] apply static analysis to infer ownership for automatic library binding generation. Giving the different application domains, each of these works makes different assumptions. Heine et al. [24] assumes that indirectly-accessed pointers (i.e. any pointer accessed through a path, like (*p).next) cannot acquire ownership, whereas Ravitch et al. [37] assumes that all struct fields are owning unless explicitly annotated. We took from [24] its handling of flow sensitivity, but enhanced it with the analysis of nested pointers and inductively defined data structures, which we found to be essential for translating real-world code. The analysis in [24] assigns a default “non-owning” status to all indirectly accessed pointers. This rules out many interesting data structures such as linked lists, trees, hash tables, etc., and commonly used idioms such as passing by reference. Conversely, in our work, we rely on a strengthening assumption about the Rust ownership model, which allows handling the aforementioned scenarios and data structures. Lastly, the idea of ownership is also broadly applied in concurrent separation logic [7,8,9, 19, 38]. However, these works are not aimed as automatic ownership inference systems.

Rust Verification. The separation logic based reasoning framework Iris [28] was used to formalise the Rust type system [27], and verify Rust programs [34]. While these works cover unsafe Rust fragments, they are not fully automatic. When restricting reasoning to only safe Rust, RustHorn [35] gives a first-order logic formulation of the behavior of Rust code, which is ameanable to fully automatic verification, while Prusti [3] leverages Rust compiler information to generate separation logic verification conditions that are discharged by Viper [36]. In the current work, we provide an automatic ownership analysis for unsafe Rust programs.

Type Qualifiers. Type qualifiers are a lightweight, practical mechanism for specifying and checking properties not captured by traditional type systems. A general flow-insensitive type qualifier framework has been proposed [21], with subsequent applications analysing Java reference mutability [22, 25] and C array bounds [32]. We adapted these works to Rust for our mutability and fatness analyses, respectively.

C to Rust Translation. We have already discussed c2rust  [26], which is an industrial strength tool that converts C to Rust syntax. c2rust does not attempt to fix unsafe features such as raw pointers and the programs it generates are always annotated as unsafe. Nevertheless it forms the bases of other translation efforts. CRustS [31] applies AST-based code transformations to remove superfluous unsafe labelling generated by c2rust. But it does not fix the unsafe features either. Laertes [17] is the first tool that is actually able to automatically reduce the presence of unsafe code. It uses the Rust compiler as a blackbox oracle and search for code changes that remove raw pointers, which is different from Crown ’s approach (see Sect. 8 for an experimental comparison). The subsequent work [15] develops an evaluation methodology for studying the limitations of existing techniques that translate unsafe raw pointers to safe Rust references. The work adopts a new concept of ‘pseudo safety’, under which semantics preservation of the original programs is no longer guaranteed. As explained in Sect. 8, in our work, we aim to maintain semantic equivalence.

10 Conclusion

We devised an ownership analysis for Rust programs translated by c2rust that is scalable (handling half a million LOC in less than 10 s) and precise (handling inductive data structures) thanks to a strengthening of the Rust ownership model, which we call ownership monotonicity. Based on this new analysis, we prototyped a refactoring tool for translating C programs into Rust programs. Our experimental evaluation shows that the proposed approach handles real-world benchmarks and outperforms the state-of-the-art.