Achieving Safety Incrementally with Checked C

,


Introduction
Vulnerabilities that compromise memory safety are at the heart of many attacks.Spatial safety, one aspect of memory safety, is ensured when any pointer dereference is always within the memory allocated to that pointer.Buffer overruns violate spatial safety, and still constitute a common cause of vulnerability.During 2012-2018, buffer overruns were the source of 9.7% to 18.4% of CVEs reported in the NIST vulnerability database [28], constituting the leading single cause of CVEs.
The source of memory unsafety starts with the language definitions of C and C++, which render out-of-bounds pointer dereferences "undefined."Traditional compilers assume they never happen.Many efforts over the last 20 years have aimed for greater assurance by proving that accesses are in bounds, and/or preventing out-of-bounds accesses from happening via inserted dynamic checks [26, 25, 30, 3, 15, 1, 2, 4, 7, 6, 8-10, 12, 5, 16, 22, 18].This paper focuses on Checked C, a new language design for a memory-safe C [11], currently focused on spatial safety.Checked C draws substantial inspiration from prior safe-C efforts but differs in two key ways, both of which focus on backward compatibility with, and incremental improvement of, regular C code.
Mixing checked and legacy pointers.First, as outlined in Section 2, Checked C permits intermixing checked (safe) pointers and legacy pointers.The former come in three varieties: pointers to single objects Ptr<τ >; pointers to arrays Array ptr <τ >, and NUL-terminated arrays Nt array ptr <τ >.The latter two have an associated clause that describes their known length in terms of constants and other program variables.The specified length is used to either prove pointer dereferences are safe or, barring that, serves as the basis of dynamic checks inserted by the compiler.Importantly, checked pointers are represented as in normal C-no changes to pointer structure (e.g., by "fattening" a pointer to include its bounds) are imposed.As such, interoperation with legacy C is eased.Moreover, the fact that checked and legacy pointers can be intermixed in the same module eases the porting process, including porting via automated tools.For example, CCured [27] works by automatically classifying existing pointers and compiling them for safety.This classification is necessarily conservative.For example, if a function f (p) is mostly called with safe pointers, but once with an unsafe one (e.g., a "wild" pointer in CCured parlance, perhaps constructed from an int), then the classification of p as unsafe will propagate backwards, poisoning the classification of the safe pointers, too.The programmer will be forced to change the code and/or pay a higher cost for added (but unnecessary) run-time checks.
On the other hand, in the Checked C setting, if a function uses a pointer safely then its parameter can be typed that way.It is then up to a caller whose pointer arguments cannot also be made safe to insert a local cast.Section 5 presents a preliminary, whole-program analysis that utilizes the extra flexibility afforded by mixing pointers to partially convert a C program to a Checked C program.On a benchmark suite of five programs totaling more than 200K LoC, we find that thousands of pointer locations are made more precise than would have been if using a more conservative algorithm like that of CCured.
Avoiding blame with checked regions.An important question is what "safety" means in a program with a mix of checked and unchecked pointers.In such a program, safety violations are still possible.How, then, does one assess that a program is safer due to checking some, but not all, of its pointers?Providing a formal answer to this question constitutes the core contribution of this paper.
Unlike past safe-C efforts, Checked C specifically distinguishes parts of the program that are and may not be fully "safe."So-called checked regions differ from unchecked ones in that they can only use checked pointers-dereference or creation of unchecked pointers, unsafe casts, and other potentially dangerous constructs are disallowed.Using a core calculus for Checked C programs called CoreChkC, defined in Section 3, we prove in Section 4 these restrictions are sufficient to ensure that checked code cannot be blamed.That is, checked code is internally safe, and any run-time failure can be attributed to unchecked code, even if that failure occurs in a checked region.This proof has been fully formalized in the Coq proof assistant.Our theorem fills a gap in the literature on migratory typing for languages that, like Checked C, use an erasure semantics, meaning that no extra dynamic checks are inserted at checked/unchecked code boundaries [14].Moreover, our approach is lighter weight than the more sophisticated techniques used by the RustBelt project [17], and constitutes a simpler first step toward a safe, mixed-language design.We say more in Section 6.

Overview of Checked C
We begin by presenting a brief overview of Checked C, using the example in Figure 1.For more about the language see Elliott et al [11].
Checked pointers.As mentioned in the introduction, Checked C supports three varieties of checked (safe) pointers: pointers to single objects Ptr<τ >; pointers to arrays Array ptr <τ >, and NUL-terminated arrays Nt array ptr <τ >.The dat field of struct buf, defined in Figure 1(b), is an Array ptr <char>; its length is specified by sz field in the same struct, as indicated by the count annotation.
Nt array ptr <τ >types are similar.The q argument to the alloc buf function in Figure 1(c) is Ptr<struct buf>.This function overwrites the contents of q with those in the second argument src, an array whose length is specified by the third argument, len.Variables with checked pointer types or containing checked pointers must be initialized when they are declared.
Checked arrays.Checked C also supports a checked array type, which is designated by prefixing the dimension of an array declaration with the keyword Checked.For example, int arr Checked [5] declares a 5-element integer array where accesses are always bounds checked.A checked array of τ implicitly converts to an Array ptr <τ > when accessing it.In our example, the array region has an unchecked array type because the Checked keyword is omitted.
Checked and unchecked regions.Returning to alloc buf : If q→ buf is too small (len > q→ sz) to hold the contents of src, the function allocates a block from the static region array, whose free area starts at index idx.Designating a checked Array ptr <char> from a pointer into the middle of the (unchecked) region array is not allowed in checked code, so it must be done within the designated Unchecked block.Within such blocks the programmer has the full freedom of C, along with the ability to create and use checked pointers.Checked code, as designated by the Checked annotation (e.g., as on the alloc buf function or on a block nested within unchecked code) may not use unchecked pointers or arrays.It also may not define or call functions without prototypes and variable argument functions.
Interface types.Once alloc buf has allocated q→ dat it calls copy to transfer the data into it, from src.Checked C permits normal C functions, such as those in an existing library, to be given an interface type.This is the type that Checked C code should use in a checked region.In an unchecked region, either the original type or the interface type may be used.This allows the function to be called with unchecked types or checked types.For copy, this type is shown in Figure 1(a).
Interface types can also be attached to definitions within a Checked C file, not just prototypes declared for external libraries.Doing so permits the same function to be called from an unchecked region (with either checked or unchecked types) or a checked region (there it will always have the checked type).For example, if we wanted alloc buf to be callable from unchecked code with unchecked pointers, we could define its prototype as Implementation details.Checked C is implemented as an extension to the Clang/ LLVM compiler.The compiler inserts run-time checks for the evaluation of lvalue expressions whose results are derived from checked pointers and that will be used to access memory.Accessing a Ptr<τ >requires a null check, while accessing an Array ptr <τ >requires both null and bounds checks.The code for these checks is handed to LLVM, which we allow to remove checks if it can prove they will Preliminary experiments on some small, pointer-intensive benchmarks show running time overhead to be around 8.6%, on average [11].

Formalism: CoreChkC
This section presents a formal language CoreChkC that models the essence of Checked C. The language is designed to be simple but nevertheless highlight Checked C's key features: checked and unchecked pointers, and checked and unchecked code blocks.We prove our key theoretical result-checked code cannot be blamed for a spatial safety violation-in the next section.

Syntax
The syntax of CoreChkC is presented in Figure 2. Types τ classify wordsized objects while types ω also include multi-word objects.The type ptr m ω types a pointer, where m identifies its mode: mode c identifies a Checked C safe pointer, while mode u represents an unchecked pointer.In other words ptr c τ is a checked pointer type Ptr<τ > while ptr u τ is an unchecked pointer type τ * .
Multiword types ω include struct records, and arrays of type τ having size n, i.e., ptr c array n τ represents a checked array pointer type Array ptr <τ > with bounds n.We assume structs are defined separately in a map D from struct names to their constituent field definitions.Programs are represented as expressions e; we have no separate class of program statements, for simplicity.Expressions include (unsigned) integers n τ and local variables x.Constant integers n are annotated with type τ to indicate their intended type.As in an actual implementation, pointers in our formalism are represented as integers.Annotations help formalize type checking and the safety property it provides; they have no effect on the semantics except when τ is a checked pointer, in which case they facilitate null and bounds checks.Variables x, introduced by let-bindings let x = e 1 in e 2 , can only hold word-sized objects, so all structs can only be accessed by pointers.
Checked pointers are constructed using malloc@ω, where ω is the type (and size) of the allocated memory.Thus, malloc@int produces a pointer of type ptr c int while malloc@(array 10 int) produces one of type ptr c (array 10 int).
Unchecked pointers can only be produced by the cast operator, (τ )e, e.g., by doing (ptr u int)malloc@int.Casts can also be used to coerce between integer and pointer types and between different multi-word types.
Pointers are read via the * operator, and assigned to via the = operator.To read or write struct fields, a program can take the address of that field and read or write that address, e.g., x→f is equivalent to * (&x→f ).To read or write an array, the programmer can use pointer arithmetic to access the desired element, e.g., By default, CoreChkC expressions are assumed to be checked.Expression e in unchecked e is unchecked, giving it additional freedom: Checked pointers may be created via casts, and unchecked pointers may be read or written.
Design Notes.CoreChkC leaves out many interesting C language features.We do not include an operation for freeing memory, since this paper is concerned about spatial safety, not temporal safety.CoreChkC models statically sized arrays but supports dynamic indexes; supporting dynamic sizes is interesting but not meaningful enough to justify the complexity it would add to the formalism.Making ints unsigned simplifies handling pointer arithmetic.We do not model control operators or function calls, whose addition would be straightforward. 4oreChkC does not have a checked e expression for nesting within unchecked expressions, but supporting it would be easy.

Semantics
Figure 4 defines the small-step operational semantics for CoreChkC expressions in the form of judgment H; e −→ m H; r.Here, H is a heap, which is a partial map from integers (representing pointer addresses) to type-annotated integers n τ .Annotation m is the mode of evaluation, which is either c for checked mode or u for unchecked mode.Finally, r is a result, which is either an expression e, Null (indicating a null pointer dereference), or Bounds (indicating an out-ofbounds array access).An unsafe program execution occurs when the expression reaches a stuck state -the program is not an integer n τ , and yet no rule applies.Notably, this could happen if trying to dereference a pointer n that is actually invalid, i.e., H(n) is undefined. where where The semantics is defined in the standard manner using evaluation contexts E. We write E[e 0 ] to mean the expression that results from substituting e 0 into the "hole" ( ) of context E. Rule C-Exp defines normal evaluation.It decomposes an expression e into a context E and expression e 0 and then evaluates the latter via H; e 0 H ; e 0 , discussed below.The evaluation mode m is constrained by the mode(E) function, also given in Figure 4.The rule and this function ensure that when evaluation occurs within e in some expression unchecked e, then it does so in unchecked mode u; otherwise it may be in checked mode c.Rule C-Halt halts evaluation due to a failed null or bounds check.
The rules prefixed with E-are those of the computation semantics H; e 0 H ; e 0 .The semantics is implicitly parameterized by struct map D. The rest of this section provides additional details for each rule, followed by a discussion of CoreChkC's type system.
Rule E-Binop produces an integer n 3 that is the sum of arguments n 1 and n 2 .As mentioned earlier, the annotations τ on literals n τ indicate the type the program has ascribed to n.When a type annotation is not a checked pointer, the semantics ignores it.In the particular case of E-Binop for example, addition n τ1 1 +n τ2 2 ignores τ 1 and τ 2 when τ 1 is not a checked pointer, and simply annotates the result with it.However, when τ is a checked pointer, the rules use it to model bounds checks; in particular, dereferencing n τ where τ is ptr c (array l τ 0 ) produces Bounds when l = 0 (more below).As such, when n 1 is a non-zero, checked pointer to an array and n 2 is an int, result n 3 is annotated as a pointer to an array with its bounds suitably updated. 5Checked pointer arithmetic on 0 is disallowed; see below.
Rules E-Deref and E-Assign confirm the bounds of checked array pointers: the length l must be positive for the dereference to be legal.The rule permits the program to proceed for non-checked or non-array pointers (but the type system will forbid them).
Rule E-Amper takes the address of a struct field, according to the type annotation on the pointer, as long the pointer is not zero or not checked.
Rule E-Malloc allocates a checked pointer by finding a string of free heap locations and initializing each to 0, annotated to the appropriate type.Here, types(D, ω) returns k types, where these are the types of the corresponding memory words; e.g., if ω is a struct then these are the types of its fields (looked up in D), while if ω is an array of length k containing values of type τ , then we will get back k τ 's.We require k = 0 or the program is stuck (a situation precluded by the type system).
Rule E-Let uses a substitution semantics for local variables; notation e[x → n τ ] means that all occurrences of x in e should be replaced with n τ .
Rule E-Unchecked returns the result of an unchecked block.
Rules with prefix X-describe failures due to bounds checks and null checks on checked pointers.These are analogues to the E-Assign, E-Deref, E-Binop, T-Var

Typing
The typing judgment Γ ; σ m e : τ says that expression e has type τ under environment Γ and scope σ when in mode m.A scope σ is an additional environment consisting of a set of literals; it is used to type cyclic structures (in Rule T-PtrC, below) that may arise during program evaluation.The heap H and struct map D are implicit parameters of the judgment; they do not appear because they are invariant in derivations.unchecked expressions are typed in mode u; otherwise we may use either mode.
Γ maps variables x to types τ , and is used in rules T-Var and T-Let as usual.Rule T-Base ascribes type τ to literal n τ .This is safe when τ is int (always).If τ is an unchecked pointer type, a dereference is only allowed by the type system to be in unchecked code (see below), and as such any sort of failure (including a stuck program) is not a safety violation.When n is 0 then τ can be anything, including a checked pointer type, because dereferencing n would (safely) produce Null.Finally, if τ is ptr c (array 0 τ ) then dereferencing n would (safely) produce Bounds.
Rule T-PtrC is perhaps the most interesting rule of CoreChkC.It ensures checked pointers of type ptr c ω are consistent with the heap, by confirming the pointed-to heap memory has types consistent with ω, recursively.When doing this, we extend σ with n τ to properly handle cyclic heap structures; σ is used by RuleT-VConst.
To make things more concrete, consider the following program that constructs a cyclic cons cell, using a standard single-linked list representation: After executing the program above, the heap would look something like the following, where n is the integer value of p.That is, the n-th location of the heap contains 0 (the default value picked by malloc), while the (n + 1)-th location, which corresponds to &n → next, contains the literal n.

Loc n
How can we type the pointer n ptr c struct node in this heap without getting an infinite typing judgment?Γ ; σ c n ptr c struct node : ptr c struct node That's where the scope comes in, to break the recursion.In particular, using Rule T-PtrC and struct node's definition, we would need to prove two things: Γ ; σ, n ptr c struct node c H(n + 0) : int and Γ ; σ, n ptr c struct node c H(n + 1) : ptr c struct node Since H(n + 0) = 0, as malloc zeroes out its memory, we can trivially prove the first goal using Rule T-Base.However, the second goal is almost exactly what we set out to prove in the first place!If not for the presence of the scope σ, the proof the n is typeable would be infinite!However, by adding n ptr c struct node to the scope, we are essentially assuming it is well-typed to type its contents, and the desired result follows by Rule T-VConst. 6 key feature of T-PtrC is that it effectively confirms that all pointers reachable from the given one are consistent; it says nothing about other parts of the heap.So, if a set of checked pointers is only reachable via unchecked pointers then we are not concerned whether they are consistent, since they cannot be directly dereferenced by checked code.
Back to the remaining rules, T-Amper and T-BinopInt are unsurprising.Rule T-Malloc produces checked pointers so long as the pointed-to type ω is not zero-sized, i.e., is not array 0 τ .Rule T-Unchecked introduces unchecked mode, relaxing access rules.Rule T-Cast enforces that checked pointers cannot be cast targets in checked mode.
Rules T-Deref and T-Assign type pointer accesses.These rules require unchecked pointers only be dereferenced in unchecked mode.Rule T-Index permits reading a computed pointer to an array, and rule T-IndAssign permits writing to one.These rules are not strong enough to permit updating a pointer to an array after performing arithmetic on it.In general, Checked C's design permits overcoming such limitations through selective use of casts in unchecked code.(That said, our implementation is more flexible in this particular case.)

Checked Code Cannot be Blamed
Our main formal result is that well-typed programs will never fail with a spatial safety violation that is due to a checked region of code, i.e., checked code cannot be blamed.This section presents the main result and outlines its proof.We have mechanized the full proof using the Coq proof assistant.The development is roughly 3500 lines long, including comments.We can make the development available upon request (and will release it publicly).

Progress and Preservation
The blame theorem is proved using the two standard syntactic type-safety notions of Progress and Preservation, adapted for CoreChkC.Progress indicates that a well-typed program either is a value, can take a step (in either mode), or else is stuck in unchecked code.A program is in unchecked mode if its expression e only type checks in mode u, or its (unique) context E has mode u.
Theorem 1 (Progress).If • m e : τ (under heap H) then one of the following holds: e is an integer n τ -There exists H , m , and r such that H; e −→ m H ; r where r is either some e , Null, or Bounds.m = u or e = E[e ] and mode(E) = u for some E, e .
Preservation indicates that if a well-typed program in checked mode takes a checked step then the resulting program is also well-typed in checked mode.
Theorem 2 (Preservation).If Γ ; • c e : τ (under a heap H) and H; e −→ c H ; r (for some H , r), then and r = e implies H H and Γ ; • c e : τ (under heap H ).
We write H H to mean that for all n τ if • c n τ : τ under H then • c n τ : τ under H as well.
The proofs of both lemmas are by induction on the typing derivation.The Preservation proof is the most delicate, particularly ensuring H H despite the creation or modification of cyclic data structures.Crucial to the proof were two lemmas dealing with the scope, weakening and strengthening.
The first lemma, scope weakening, allows us to arbitrarily extend a scope with any literal n τ0 0 .
Intuitively, this lemma holds because if a proof of Γ ; σ m n τ : τ relies on the rule T-VConst, then that n τ1 1 ∈ σ for some n τ1 1 .But then n τ1 1 ∈ (σ, n τ0 0 ) as well.Importantly, the scope σ is a set of n τ and not a map from n to τ .As such, if n τ is already present in σ, adding n τ 0 will not clobber it.Allowing the same literal to have multiple types is practically important.For example a pointer n to a struct could be annotated with the type of the struct, or the type of the first field of the struct, or int; all may safely appear in the environment.
Consider the proof that n ptr c struct node is well typed for the heap given in Section 3.3.After applying Rule T-PtrC, we used the fact that n ptr c struct node ∈ σ, n ptr c struct node to prove that the next field of the struct is well typed.If we were to replace σ with another scope σ, n τ0 0 for some typed literal n τ0 0 (and as a result any scope that is a superset of σ), the inclusion n ptr c struct node ∈ σ, n τ0 0 , n ptr c struct node still holds and our pointer is still well-typed.Conversely, the second lemma, scope strengthening, allows us to remove a literal from a scope, if that literal is well typed in an empty context.

Lemma 2 (Strengthening). If
Informally, if the fact that n τ2 2 is in the scope is used in the proof of well-typedness of n τ1 1 to prove that n τ2 2 is well-typed for some scope σ, then we can just use the proof that it is well-typed in an empty scope, along with weakening, to reach the same conclusion.
Looking back again at the proof of the previous section, we know that Γ ; • c n : ptr c struct node and Γ ; σ, n ptr c struct node c &n → next : ptr c struct node While the proof of the latter fact relies on n ptr c struct node being in scope, that would not be necessary if we knew (independently) that it was well-typed.That would essentially amount to unrolling the proof by one step.

Blame
With progress and preservation we can prove a blame theorem: Only unchecked code can be blamed as the ultimate reason for a stuck program.
Theorem 3 (Checked code cannot be blamed This theorem means that a code reviewer can focus on unchecked code regions, trusting that checked ones are safe.

Automatic Porting
Porting legacy code to use Checked C's features can be tedious and time consuming.To assist the process, we developed a source-to-source translator called checked-c-convert that discovers some safely-used pointers and rewrites them to be checked.This algorithm is based on one used by CCured [27], but exploits Checked C's allowance of mixing checked and unchecked pointers to make less conservative decisions.
The checked-c-convert translator works by (1) traversing a program's abstract syntax tree (AST) to generate constraints based on pointer variable declaration and use; (2) solving those constraints; and (3) rewriting the program.These rewrites consist of promoting some declared pointer types to be checked, some parameter types to be bounds-safe interfaces, and inserting some casts.checked-c-convert aims to produce a well-formed Checked C program whose changes from the original are minimal and unsurprising.A particular challenge is to preserve syntactic structure of the program.A rewritten program should be recognizable by the author and it should be usable as a starting point for both the development of new features and additional porting.The checked-c-convert tool is implemented as a clang libtooling application and is freely available.

Constraint logic and solving
The basic approach is to infer a qualifier q i for each defined pointer variable i. Inspired by CCured's approach [27], qualifiers can be either P T R, ARR and U N K, ordered as a lattice P T R < ARR < U N K.Those variables with inferred qualifier P T R can be rewritten into Ptr<τ > types, while those with U N K are left as is.Those with the ARR qualifier are eligible to have Array ptr <τ > type.
For the moment we only signal this fact in a comment and do not rewrite because we cannot always infer proper bounds expressions.
Qualifiers are introduced at each pointer variable declaration, i.e., parameter, variable, field, etc. Constraints are introduced as a pointer is used, and take one of the following forms: An expression that performs arithmetic on a pointer with qualifier q i , either via + or [], introduces a constraint q i = ARR.Assignments between pointers introduce aliasing constraints of the form q i = q j .Casts introduce implication constraints based on the relationship between the sizes of the two types.If the sizes are not comparable, then both constraint variables in an assignment-based cast are constrained to U N K via an equality constraint.One difference from CCured is the use of negation constraints, which are used to fix a constraint variable to a particular Checked C type (e.g., due to an existing Ptr<τ > annotation).These would cause problems for CCured, as they might introduce unresolvable conflicts.But Checked C's allowance of checked and unchecked code can resolve them using explicit casts and bounds-safe interfaces, as discussed below.
One problem with unification-based analysis is that a single unsafe use might "pollute" the constraint system by introducing an equality constraint to U N K that transitively constrains unified qualifiers to U N K as well.For example, casting a struct pointer to a unsigned char buffer to write to the network would cause all transitive uses of that pointer to be unchecked.The tool takes advantage of Checked C's ability to mix checked and unchecked pointers to solve this problem.In particular, constraints for each function are solved locally, using separate qualifier variables for each external function's declared parameters.

Algorithm
Our modular algorithm runs as follows: 1.The AST for every compilation unit is traversed and constraints are generated based on the uses of pointer variables.Each pointer variable x that appears at a physical location in the program is given a unique constraint variable q i at the point of declaration.Uses of x are identified with the constraint variable created at the point of declaration.A distinction is made for parameter and return variables depending on if the associated function definition is a declaration or a definition: -Declaration: There may be multiple declarations.The constraint variables for the parameters and return values in the declarations are all constrained to be equal to each other.At call sites, the constraint variables used for a function's parameters and return values come from those in the declaration, not the definition (unless there is no declaration).-Definition: There will only be one definition.These constraint variables are not constrained to be equal to the variables in the declarations.This enables modular (per function) reasoning.
2. After the AST is traversed, the constraints are solved using a fast, unificationfocused algorithm [27].The result is a set of satisfying assignments for constraint variables q i .3.Then, the AST is re-traversed.At each physical location associated with a constraint variable, a re-write decision is made based on the value of the constraint variable.These physical locations are variable declaration statements, either as members of a struct, function variable declarations, or parameter variable declarations.There is a special case, which is any constraint variable appearing at a parameter position, either at a function declaration/definition, or, a call site.That case is discussed in more detail next.4. All of the re-write decisions are then applied to the source code.

Resolving conflicts
Defining distinct constraint variables for function declarations, used at call-sites, and function definitions, used within that function, can result in conflicting solutions.If there is a conflict, then the declaration's solution is safer than the definition, or the definition's is safer than the declaration's.Which case we are in can be determined by considering the relationship between the variables' valuations in the qualifier lattice.There are three cases: -No imbalance: In this case, the re-write is made based on the value of the constraint variable in the solution to the unification -Declaration (caller) is safer than definition (callee): In this case, there is nothing to do for the function, since the function does unknown things with the pointer.This case will be dealt with at the call site by inserting a cast.-Decalaration (caller) is less safe than definition (callee): In this case, there are call sites that are unsafe, but the function itself is fine.We can re-write the function declaration and definition with a bounds-safe interface.The itype syntax indicates that a can be supplied by the caller as either an int * or a Ptr<τ >, but the function body will treat a as a Ptr<τ >. (See Section 2 for more on interface types.) This approach has advantages and disadvantages.It favors making the fewest number of modifications across a project.An alternative to using interface types would be to change the parameter type to a Ptr<τ >directly, and then insert casts at each call site.This would tell the programmer where potentially bogus pointer values were, but would also increase the number of changes made.Our approach does not immediately tell the programmer where the pointer changes need to be made.However, the Checked C compiler will do that if the programmer takes a bounds-safe interface and manually converts it into a non-interface Ptr<τ >type.Every location that would require a cast will fail to type check, signaling to the programmer to have a closer look.

Experimental Evaluation
We carried out a preliminary experimental evaluation of the efficacy of checkedc-convert.To do so, we ran it on five targets-programs and libraries-and recorded how many pointer types the rewriter converted and how many casts were inserted.We chose these targets as they constitute legacy code used in commodity systems, and in security-sensitive contexts.
Running checked-c-convert took no more than 30 minutes to run, for each target.Table 1 contains the results.The first and last column indicate the target, its version, and the lines of code it contains (per textttcloc).The second column (# of *) counts the number of pointer definitions or declarations in the program, i.e., places that might get rewritten when porting.The next three columns (% Ptr, Arr., Unk.) indicate the percentages of these that were determined to be P T R, ARR, or U N K, respectively, where only those in % Ptr induce a rewriting action.The results show that a fair number of variables can be automatically rewritten as safe, single pointers ( Ptr<τ >).After investigation, there are usually two reasons that a pointer cannot be replaced with a Ptr<τ >: either some arithmetic is performed on the pointer, or it is passed as a parameter to a library function for which a bounds-safe interface does not exist.
The next two columns (Casts(Calls), Ifcs(Funcs)) examine how our rewriting algorithm takes advantage of Checked C's support for incremental conversion.In particular, column 6 (Casts(Calls)) counts how many times we cast a safe pointer at the call site of a function deemed to use that pointer unsafely; in parentheses we indicate the total number of call sites in the program.Column 7 (Ifcs(Funcs)) counts how often a function definition or declaration has its type rewritten to use an interface type, where the total declaration/definition count is in parentheses.This rewriting occurs when the function itself uses at least one of its parameters safely, but at least one caller provides an argument that is deemed unsafe.Both columns together represent an improvement in precision, compared to unification-only, due to Checked C's focus on backward compatibility.
This experiment represents the first step a developer would take to adopting Checked C into their project.The values converted into Ptr<τ > by the re-writer need never be considered again during the rest of the conversion or by subsequent software assurance / bug finding efforts.

Related Work
There has been substantial prior work that aims to address the vulnerability presented by C's lack of memory safety.A detailed discussion of how this work compares to Checked C can be found in Elliott et al [11].Here we discuss approaches for automating C safety, as that is most related to work on our rewriting algorithm.We also discuss prior work generally on migratory typing, which aims to support backward compatible migration of an untyped/less-typed program to a statically typed one.
Security mitigations.The lack of memory safety in C and C++ has serious practical consequences, especially for security, so there has been extensive research toward addressing it automatically.One approach is to attempt to detect memory corruption after it has happened or prevent an attacker from exploiting a memory vulnerability.Approaches deployed in practice include stack canaries [32], address space layout randomization (ASLR) [35], data-execution prevention (DEP), and control-flow integrity (CFI) [1].These defenses have led to an escalating series of measures and counter-measures by attackers and defenders [33].These approaches do not prevent data modification or data disclosure attacks, and they can be defeated by determined attackers who use those attacks.By contrast, enforcing memory safety avoids these issues.
Memory-safe C. Another important line of prior work aims to enforce memory safety for C; here we focus on projects that aim to do so (mostly) automatically in a way related to our rewriting algorithm.CCured [26] is a source-to-source rewriter that transforms C programs to be safe automatically.CCured's goal is end-to-end soundness for the entire program.It uses a whole-program analysis that divides pointers into fat pointers (which allow pointer arithmetic and unsafe casts) and thin pointers (which do not).The use of fat pointers causes problems interoperating with existing libraries and systems, making the CCured approach impractical when that is necessary.Other systems attempt to overcome the limitations of fat pointers by storing the bounds information in a separate metadata space [25,24] or within unused bits in 64-bit pointers [19] (though this approach is unsound [13]).These approaches can add substantial overhead; e.g., Softbound's overhead for spatial safety checking is 67%.Deputy [39] uses backwardcompatible pointer representations with types similar to those in Checked C. It supports inference local to a function, but resorts to manual annotations at function and module boundaries.None of these systems permit intermixing safe and unsafe pointers within a module, as Checked C does, which means that some code simply needs to be rewritten rather than included but clearly marked within Unchecked blocks.
Migratory Typing.Checked C is closely related to work supporting migratory typing [36] (aka gradual typing [31]).In that setting, portions of a program written in a dynamically typed language can be annotated with static types.For Checked C, legacy C plays the role of the dynamically typed language and checked regions play the role of statically typed portions.In migratory typing, one typically proves that a fully annotated program is statically type-safe.What about mixed programs?They can be given a semantics that checks static types at boundary crossings [21].For example, calling a statically typed function from dynamically typed code would induce a dynamic check that the passed-in argument has the specified type.When a function is passed as an argument, this check must be deferred until the function is called.The delay prompted research on proving blame: Even if a failure were to occur within static code, it could be blamed on bogus values provided by dynamic code [37].This semantics is, however, slow [34], so many languages opt for what Greenman and Felleisen [14] term the erasure semantics: No checks are added and no notion of blame is proved, i.e., failures in statically typed code are not formally connected to errors in dynamic code.Checked C also has erasure semantics, but Theorem 3 is able to lay blame with the unchecked code.
Rust.Rust [20] is a programming language, like C, that supports zero-cost abstractions, but like Checked C, aims to be safe.Rust programs may have designated unsafe blocks in which certain rules are relaxed, potentially allowing run-time failures.As with Checked C, the question is how to reason about the safety of a program that contains any amount of unsafe code.The RustBelt project [17] proposes to use a semantic [23], rather than syntactic [38], account of soundness, in which (1) types are given meaning according to what terms inhabit them; (2) type rules are sound when interpreted semantically; and (3) semantic well typing implies safe execution.With this approach, unsafe code can be (manually) proved to inhabit the semantic interpretation of its type, in which case its use by type-checked code will be safe.
We view our approach as complementary to that of RustBelt, perhaps constituting the first step in mixed-language safety assurance.In particular, we employ a simple, syntactic proof that checked code is safe and unchecked code can always be blamed for a failure-no proof about any particular unsafe code is required.Stronger assurance that programs are safe despite using mixed code could employ the RustBelt approach.

Conclusions and Future Work
This paper has presented CoreChkC, a core formalism for Checked C, an extension to C aiming to provide spatial safety.CoreChkCmodels Checked C's safe (checked) and unsafe (legacy) pointers; while these pointers can be intermixed, use of legacy pointers is severely restricted in checked regions of code.We prove that these restrictions are efficacious: checked code cannot be blamed in the sense that any spatial safety violation must be directly or indirectly due to an unsafe operation outside a checked region.Our formalization and proof are mechanized in the Coq proof assistant.The freedom to intermix safe and legacy pointers in Checked C programs affords flexibility when porting legacy code.We show this is true for automated porting as well.A whole-program rewriting algorithm we built is able to make more pointers safe than it would if pointer types were all-or-nothing; we do this by taking advantage of Checked C's allowed casts and interface types.
As future work, we are interested in formalizing other aspects of Checked C, notably its subsumption algorithm and support for flow-sensitive typing (to handle pointer arithmetic), to prove that these aspects of the implementation are correct.We are also interested in expanding support for the rewriting algorithm, by using more advanced static analysis techniques to infer numeric bounds suitable for re-writing array types.Finally, we hope to automatically infer regions of code that could be enclosed within checked regions.

Table 1 .
Number of pointer declarations converted through automated porting 3}