1 Introduction

One of the selling points of proof assistants is their trustworthiness. Yet in practice soundness problems do come up in most proof assistants. Harrison [11] distinguishes errors in the logic and errors in the implementation (and cites examples). Our work contributes to the solution of both problems for the proof assistant Isabelle [31]. Isabelle is a generic theorem prover: it implements \(\mathcal {M}\), a fragment of intuitionistic higher-order logic, as a metalogic for defining object logics. Its most developed object logic is HOL and the resulting proof assistant is called Isabelle/HOL [24, 25]. The latter is the basis for our formalizations.

Our first contribution is the first complete formalization of Isabelle’s metalogic. Thus our work applies to all Isabelle object logics, e.g. not just HOL but also ZF. Of course Paulson [30] describes \(\mathcal {M}\) precisely, but only on paper. More importantly, his description does not cover polymorphism and type classes, which were introduced later [26]. The published account of Isabelle’s proof terms [4] is also silent about type classes. Yet type classes are a significant complication (as, for example, Kunčar and Popescu [18] found out).

Our second contribution is a verified (against \(\mathcal {M}\)) and executable checker for Isabelle’s proof terms. We have integrated the proof checker with Isabelle. Thus we can guarantee that every theorem whose proof our proof checker accepts is provable in our definition of \(\mathcal {M}\). So far we are able to check the correctness of moderatly sized theories across the full range of logics implemented in Isabelle.

Although Isabelle follows the LCF-architecture (theorems that can only be manufactured by inference rules) it is based on an infrastructure optimized for performance. In particular, this includes multithreading, which is used in the kernel and has once lead to a soundness issueFootnote 1 . Therefore we opt for the “certificate checking” approach (via proof terms) instead of verifying the implementation.

This is the first work that deals directly with what is implemented in Isabelle as opposed to a study of the metalogic that Isabelle is meant to implement. Instead of reading the implementation you can now read and build on the more abstract formalization in this paper. The correspondence of the two can be established for each proof by running the proof checker.

Our formalization reflects the ML implementation of Isabelle’s terms and types and some other data structures. Thus a few implementation choices are visible, e.g. De Bruijn indices. This is necessary because we want to integrate our proof checker as directly as possible with Isabelle, with as little unverified glue code as possible, for example no translation between De Bruijn indices and named variables. We refer to this as our intentional implementation bias. In principle, however, one could extend our formalization with different representations (e.g. named terms) and prove suitable isomorphisms. Our work is purely proof theoretic; semantics is out of scope.

The formalization can be found in the Archive of Formal Proofs[28].

2 Related Work

Harrison [11] was the first to verify some of HOL’s metatheory and an implementation of a HOL kernel in HOL itself. Kumar et al. [13] formalized HOL including definition principles, proved its soundness and synthesized a verified kernel of a HOL prover down to the machine language level. Abrahamsson [2] verified a proof checker for the OpenTheory [12] proof exchange format for HOL.

Wenzel [38] showed how to interpret type classes as predicates on types. We follow his approach of reflecting type classes in the logic but cannot remove them completely because of our intentional implementation bias (see above). Kunčar and Popescu [15,16,17,18] focus on the subtleties of definition principles for HOL with overloading and prove that under certain conditions, type and constant definitions preserve consistency. Åman Pohjola et al. [1] formalize [15, 18].

Adams [3] presents HOL Zero, a basic theorem prover for HOL that addresses the problem of how to ensure that parser and pretty-printer do not misrepresent formulas.

Let us now move away from Isabelle and HOL. Sozeau et al. [36] present the first implementation of a type checker for the kernel of Coq that is proved correct in Coq with respect to a formal specification. Carneiro [6] has implemented a highly performant proof checker for a multi-sorted first order logic and is in the process of verifying it in its own logic.

We formalize a logic with bound variables, and there is a large body of related work that deals with this issue (e.g. [7, 21, 37]) and a range of logics and systems with special support for handling bound variables (e.g. [33,34,35]). We found that De Bruijn indices worked reasonably well for us.

3 Preliminaries

Isabelle types are built from type variables, e.g. , and (postfix) type constructors, e.g. ; the function type arrow is . Isabelle also has a type class system explained later. The notation \(t \; {:}{:} \; \tau \) means that term t has type \(\tau \). Isabelle/HOL provides types and of sets and lists of elements of type . They come with the following vocabulary: function (conversion from lists to sets), (list constructor), (append), (length of list ), (the th element of starting at 0), and other self-explanatory notation.

The of a relation is the set of all such that is in .

There is also the predefined data type

figure v

The type abbreviates , i.e. partial functions, which we call maps. Maps have a domain and a range:

figure y

Logical equivalence is written instead of .

4 Types and Terms

A is simply a string. Variables have type ; their inner structure is immaterial for the presentation of the logic.

The logic has three layers: terms are classified by types as usual, but in addition types are classified by sorts. A is simply a set of class names. We discuss sorts in detail later.

Types (typically denoted by , , ...) are defined like this:

figure ag

where represents the Isabelle type and represents a type variable of sort — sorts are directly attached to type variables. The notation is short for is the name of the function type constructor.

Isabelle’s terms are simply typed lambda terms in De Bruijn notation:

figure ap

A term (typically , , , ...) can be a typed constant or free variable , a bound variable (a De Brujin index), a typed abstraction or an application .

The term-has-type proposition has the syntax where is a list of types, the context for the type of the bound variables.

We define .

Function collects the free variables in a term. Because bound variables are indices, is simply the set of all occurs in . The type is an integral part of a variable.

A type substitution is a function of type . It assigns a type to each type variable and sort pair. We write for the overloaded function which applies such a type substitution to all type variables (and their sort) occurring in a type or term. The type instance relation is defined like this:

figure bj

We also need to \(\beta \)-contract a term to something like “ with replaced by ”. We define a function such that is that \(\beta \)-contractum. The definition of is shown in the Appendix and can also be found in the literature (e.g. [23]).

In order to abstract over a free (term) variable there is a function that (roughly speaking) replaces all occurrences of . Again, see the Appendix for the definition. This produces (if occurs in ) a term with an unbound . Function binds it with an abstraction:

figure bx

While this section described the syntax of types and terms, they are not necessarily wellformed and should be considered pretypes/preterms. The wellformedness checks are described later.

5 Classes and Sorts

Isabelle has a built-in system of type classes [22] as in Haskell 98 except that class constraints are directly attached to variable names: our corresponds to Haskell’s (C a, D a, ...) => ... a ....

A is Isabelle’s terminology for a set of (class) names, e.g. , which represent a conjunction of class constraints. In our work, variables etc. stand for sorts.

Apart from the usual application in object logics, type classes also serve an important metalogical purpose: they allow us to restrict, for example, quantification in object logics to object-level types and rule out meta-level propositions.

Isabelle’s type class system was first presented in a programming language context [27, 29]. We give the first machine-checked formalization. The central data structure is a so-called order-sorted signature. Intuitively, it is comprised of a set of class names, a partial subclass ordering on them and a set of type constructor signatures. A type constructor signature for a type constructor states that applying to types such that has sort (defined below) produces a type of class . Formally:

figure cj

To explain this formalization we start from a pair and recover the informal order-sorted signature described above. The set of classes is simply the of the relation. The component represents the set of all type constructor signatures (where is a list of sorts) such that . Representing as a triple, we define

figure cs

is the translation of , the data structure close to the implementation, to an equivalent but more intuitive version that is close to the informal presentations in the literature.

The subclass ordering can be extended to a subsort ordering as follows:

figure cx

The smaller sort needs to subsume all the classes in the larger sort. In particular iff .

Now we can define a predicate that checks whether, in the context of some order-sorted signature , a type fulfills a given sort constraint:

figure dc

The rule for type variables uses the subsort relation and is obvious. A type has sort if for every there is a signature and for .

We normalize a sort by removing “superfluous” class constraints, i.e. retaining only those classes that are not subsumed by other classes. This gives us unique representatives for sorts which we call normalized:

figure dj

We work with normalized sorts because it simplifies the derivation of efficient executable code later on.

Now we can define wellformedness of an :

figure dl

A sublass relation is wellformed if it is a partial order where reflexivity is restricted to its . Wellformedness of type constructor signatures ( ) is more complex. We describe it in terms of derived from (see above). The conditions are the following:

  • The following property requires a) that for any there must be a for every superclass of and b) coregularity which guarantees the existence of principal types [10, 29].

  • A type constructor must always take the same number of argument types:

  • Sorts must be normalized and must exists in : where

These conditions are used in a number of places to show that the type system is well behaved. For example, is upward closed:

figure ea

6 Signatures

A signature consist of a map from constant names to their (most general) types, a map from type constructor names to their arities, and an order-sorted signature:

figure eb

The three projection functions are called , and . We now define a number of wellformedness checks w.r.t. a signature . We start with wellformedness of types, which is pretty obvious:

Wellformedness of a term essentially just says that all types in the term are wellformed and that the type of a constant in the term must be an instance of the type of that constant in the signature: .

These rules only check whether a term conforms to a signature, not that the contained types are consistent. Combining wellformedness and yields welltypedness of a term:

figure ek

Wellformedness of a signature is defined as follows:

figure em

In words: all types in are wellformed, is wellformed, the type constructors in are exactly those that have an arity in , for every type constructor signature in , .

7 Logic

Isabelle’s metalogic \(\mathcal {M}\) is an extension of the logic described by Paulson [30]. It is a fragment of intuitionistic higher-order logic. The basic types and connectives of \(\mathcal {M}\) are the following:

figure eu

The type subscripts of and are dropped in the text if they can be inferred.

Readers familiar with Isabelle syntax must keep in mind that for readability we use the symbols , and for the encodings of the respective symbols in Isabelle’s metalogic. We avoid the corresponding metalogical constants completely in favour of HOL’s , , and inference rule notation.

The provability judgment of \(\mathcal {M}\) is of the form is a theory, (the hypotheses) is a set of terms of type and a term of type .

A theory is a pair of a signature and a set of axioms:

figure fi

The projection functions are and . We extend the notion of wellformedness from signatures to theories:

figure fl

The first two conjuncts need no explanation. Predicate (not shown) requires the signature to have certain minimal content: the basic types ( ) of \(\mathcal {M}\) and the additional types and constants for type class reasoning from Section 7.3. Our theories also need to contain a minimal set of axioms. The set is an axiomatic basis for equality reasoning and will be explained in Section 7.2.

We will now discuss the inference system in three steps: the basic inference rules, equality and type class reasoning.

7.1 Basic Inference Rules

The axiom rule states that wellformed type-instances of axioms are provable:

where is a type substitution and denotes its application (see Section 4). The types substituted into the type variables need to be wellformed and conform to the sort constraint of the type variable:

figure fr

The conjunction only needs to hold if actually changes something, i.e. if . This condition is not superfluous because otherwise and only hold if is wellformed w.r.t .

Note that there are no extra rules for general instantiation of type or term variables. Type variables can only be instantiated in the axioms. Term instantiation can be performed using the forall introduction and elimination rules.

The assumption rule allows us to prove terms already in the hypotheses:

Both and are characterized by introduction and elimination rules:

where .

7.2 Equality

Most rules about equality are not part of the inference system but are axioms (the set mentioned above). Consequences are obtained via the axiom rule.

The first three axioms express that is reflexive, symmetric and transitive:

figure gd

The next two axioms express that terms of type ( and ) are equal iff they are logically equivalent:

figure gh

The last equality axioms are congruence rules for application and abstraction:

figure gi

Paulson [30] gives a slightly different congruence rule for abstraction, which allows to abstract over an arbitrary, free in . We are able to derive this rule in our inference system.

Finally there are the lambda calculus rules. There is no need for conversion because -equivalent terms are already identical thanks to the De Brujin indices for bound variables. For and conversion the following rules are added. In contrast to the rest of this subsection, these are not expressed as axioms.

Rule (\(\beta \)) uses the substitution function as explained in Section 4 (and defined in the Appendix).

Rule (\(\eta \)) requires a few words of explanation. We do not explicitly require that does not contain . This is already a consequence of the precondition that : it implies that is closed. For that reason it is perfectly unproblematic to remove the abstraction above .

7.3 Type Class Reasoning

Wenzel [38] encoded class constraints of the form “type has class ” in the term language as follows. There is a unary type constructor named and abbreviates . The notation is short for where is the name of a new uninterpreted constant. You should view as the term-level representation of type .

Next we represent the predicate “is of class ” on the term level. For this we define some fixed injective mapping from class to constant names. For each new class a new constant of type is added. The term represents the statement “type has class ”. This is the inference rule deriving such propositions:

figure hn

This is how the inference system is integrated into the logic.

This concludes the presentation of \(\mathcal {M}\). We have shown some minimal sanity properties, incl. that all provable terms are of type and wellformed:

Theorem 1

The attentive reader will have noticed that we do not require unused hypotheses in to be wellformed and of type . Similarly, we only require in rules that need it to preserve wellformedness of the terms and types involved. To restrict to wellformed theories and hypotheses we define a top-level provability judgment that requires wellformedness:

figure hu

8 Proof Terms and Checker

Berghofer and Nipkow [4] added proof terms to Isabelle. We present an executable checker for these proof terms that is proved sound w.r.t. the above formalization of the metalogic. Berghofer and Nipkow also developed a proof checker but it was unverified and checked the generated proof terms by feeding them back through Isabelle’s unverified inference kernel.

It is crucial to realize that all we need to know about the proof term checker is the soundness theorem below. The internals are, from a soundness perspective, irrelevant, which is why we can get away with sketching them informally. This is in contrast to the logic itself, which acts like a specification, which is why we presented it in detail.

This is our data type of proof terms:

figure hv

These proof terms are not designed to record proofs in our inference system, but to mirror the proof terms generated by Isabelle. Nevertheless, the constructors of our proof terms correspond roughly to the rules of the inference system. contains an axiom and a type substitution. This substitution is encoded as an association list instead of a function. and correspond to introduction of and , and correspond to the respective eliminations. and relate to the assumption rule, where refers to a free assumption while contains a De Brujin index referring to an assumption added during the proof by an constructor. denotes a proof that a type belongs to a given type class.

Isabelle looks at terms modulo -equivalence and therefore does not save or steps, while they are explicit steps in our inference system. Therefore we have no constructors corresponding to the (\(\beta \)) and (\(\eta )\) rules. The remaining equality axioms are naturally handled by the constructor.

In the rest of the section we discuss how to derive an executable proof checker. Executability means that the checker is defined as a set of recursive functions that Isabelle’s code generator can translate into one of a number of target languages, in particular its implementation language SML [5, 8, 9].

Because of the approximate correspondence between proof term constructors and inference rules, implementing the proof checker largely amounts to providing executable versions of each inference rule, as in LCF: each rule becomes a function that checks the side conditions, and if they are true, computes the conclusion from the premises given as arguments. The overall checker is a function

figure in

In particular we need to make the inductive wellformedness checks for sorts, types and terms, signatures and theories executable. Mostly, this amounts to providing recursive versions of inductive definitions and proving them equivalent.

We now discuss some of the more difficult implementation steps. To model Isabelle’s view of terms modulo -equivalence, we normalize our terms ( -equivalence is for free thanks to De Brujin notation) during the reconstruction of the proof. A lengthy proof shows that this preserves provability (we do not go into the details):

figure ir

Isabelle’s code generator needs some help handling the maps used in the (order-sorted) signatures. We provide a refinement of maps to association lists. Another problematic point is the definition of the type instance relation , which contains an (unbounded) existential quantifier. To make this executable, we provide an implementation which tries to compute a suitable type substitution. In another step, we refine the type substitution to an association list as well.

In the end we obtain a proof checker

figure it

that checks theory and checks if proof proves the given proposition . The latter check is important because the Isabelle theorems that we check contain both a proof and a proposition that the theorem claims to prove. Function checks this claim. As one of our main results, we can prove the correctness of our checker:

Theorem 2

The proof itself is conceptually simple and proceeds by induction over the structure of proof terms. For each proof constructor we need to show that the corresponding inference rule leads to the same conclusion as its functional version used by . Most of the proof effort goes into a large library of results about terms, types, signatures, substitutions, wellformedness etc. required for the proof, most importantly the fact that normalization preserve provability.

9 Size and Structure of the Formalization

All material presented so far has been formalized in Isabelle/HOL. The definition of the inference system (incl. types, terms etc.) resides in a separate theory that depends only on the basic library of Isabelle/HOL. It takes about 300 LOC and is fairly high level and readable – we presented most of it. This is at least an order or magnitude smaller than Isabelle’s inference kernel (which is not clearly delineated) – of course the latter is optimized for performance. Its abstract type of theorems alone takes about 2,500 LOC, not counting any infrastructure of terms, types, unification etc.

The whole formalization consists of 10,000 LOC. The main components are:

  • Almost half the formalization (4,700 LOC) is devoted to providing a library of operations on types and terms and their properties. This includes, among others, executable functions for type checking, different types of substitutions, abstractions, the wellformedness checks and and reductions.

  • Proving derived rules of our inference system takes up 3,000 LOC. A large part of this is deriving rules for equality and the and reductions. Weakening rules are also derived.

  • Making the wellformedness checks for (order-sorted) signatures and theories as well as the type instance checks executable takes 1,800 LOC.

  • Definition and correctness proof for the checker builds on the above material and take only about 500 additional LOC.

10 Integration with Isabelle

As explained above, Isabelle generates SML code for the proof checker. This code has its own definitions of types, terms etc. and needs to be interfaced with the corresponding data structures in Isabelle. This step requires 150 lines of handwritten SML code (glue code) that translates Isabelle’s data structures into the corresponding data structures in the generated proof checker such that we can feed them into . We cannot verify this code and therefore aim to keep it as small and simple as possible. This is the reason for the previously mentioned intentional implementation bias we introduced in our formalization. We describe now how the various data types are translated. We call a translation trivial if it merely replaces one constructor by another, possibly forgetting some information.

The translation of types and terms is trivial as their structure is almost identical in the two settings. For Isabelle code experts it should be mentioned that the two term constructors Free and Var in Isabelle (which both represent free variables but Var can be instantiated by unification) are combined in type of the formalization which we left unspecified but which in fact looks like this: . This is purely to trivialize the glue code, in our formalization is totally opaque.

Proof term translation is trivial except for two special cases. Previously proved lemmas become axioms in the translation (see also below) and so-called “oracles” (typically the result of unfinished proofs, i.e. “sorry” on the user level) are rejected (but none of the theories we checked contain oracles). Also remember that the translation of proofs is not safety critical because all that matters is that in the end we obtain a correct proof of the claimed proposition.

We also provide functions to translate relevant content from the background theory: axioms and (order-sorted) signatures. This mostly amounts to extracting association lists from efficient internal data structures. Translating the axioms also involves translating some alternative internal representation of type class constraints into their standard form presented in Sect. 7.3.

The checker is integrated into Isabelle by calling it every time a new named theorem has been proved. The set of theorems proved so far is added to the axiomatic basis for this check. Cyclic dependencies between lemmas are ruled out by this ordering because every theorem is checked before being added to the axiomatic basis. However, an explicit cyclicity check is not part of the formalization (yet), which speaks only about checking single proofs.

11 Running the Proof Checker

We run this modified Isabelle with our proof checker on multiple theories in various object logics contained in the Isabelle distribution. A rough overview of the scope of the covered material for some logics and the required running times can be found in the following table. The running times are the total times for running Isabelle, not just the proof checking, but the latter takes 90% of the time. All tests were performed on a Intel Core i7-9750H CPU running at 2.60GHz and 32GB of RAM.

Logic

LOC

Time

FOL

 4,500

45 secs

ZF

55,000

25 mins

HOL

10,000

26 mins

We can check the material in several smaller object logics in their entirety. One of the larger such logics is first-order logic (FOL). These logics do not develop any applications but FOL comes with proof automation and theories testing that automation, in particular Pelletier’s collection of problems that were considered challenges in their day [32]. Because the proofs are found automatically, the resulting proof terms will typically be quite complex and good test material for a proof checker.

The logic ZF (Zermelo-Fraenkel set theory) builds on FOL but contains real applications and is an order of magnitude larger than FOL. We are able to check all material formalized in ZF in the Isabelle distribution.

Isabelle’s most frequently used and largest object logic is HOL. We managed to check about 12% of the Main library. This includes the basic logic and the libraries of sets, functions, orderings, lattices and groups. The formalizations are non-trivial and make heavy use of Isabelle’s type classes.

Why can we check about five times as many lines of code in ZF compared to HOL? Profiling revealed that the proof checker spends a lot of time in functions that access the signature, especially the wellformedness checks. The primary reasons: inefficient data structures (e.g. association lists) and thus the running time depends heavily on size of signature and increases with every new constant, type and class. To make matters worse, there is no sharing of any kind in terms/types and their wellformedness checks. Because ZF is free of polymorphism and type classes, these wellformedness checks are much simpler.

12 Trust Assumptions

We need to trust the following components outside of the formalization:

  • The verification (and code generation) of our proof checker in Isabelle/HOL. This is inevitable, one has to trust some theorem prover to start with. We could improve the trustworthiness of this step by porting our proofs to the verified HOL prover by Kumar et el. [13] but its code generator produces CakeML [14], not SML.

  • The unverified glue code in the integration of our proof checker into Isabelle (Sect. 10).

Because users currently cannot examine Isabelle’s internal data structures that we start from, they have to trust Isabelle’s front end that parses and transforms some textual input file into internal data structures. One could add a (possibly verified) presentation layer that outputs those internal representations into a readable format that can be inspected, while avoiding the traps Adams [3] is concerned with.

13 Future Work

Our primary focus will be on scaling up the proof checker to not just deal with all of HOL but with real applications (including itself!). There is a host of avenues for exploration. Just to name a few promising directions: more efficient data structures than association lists (e.g. via existing frameworks [19, 20]); caching of wellformedness checks for types and terms; exploiting sharing within terms and types (tricky because our intentionally simple glue code creates copies); working with the compressed proof terms [5] that Isabelle creates by default instead of uncompressing them as we do now.

We will also upgrade the formalization of our checker from individual theorems sets of theorems, explicitly checking cyclic dependencies (which are currently prevented by the glue code, see Sect. 10).

A presentation layer as discussed in Sect. 12 would not just allow the inspection of the internal representation of the theories but could also be extended to the proofs themselves, thus permitting checkers to be interfaced with Isabelle on a textual level instead of internal data structures.

It would also be nice to have a model-theoretic semantics for \(\mathcal {M}\). We believe that the work by Kunčar and Popescu [15,16,17,18] could be adapted from HOL to \(\mathcal {M}\). This would in particular yield semantically justified cyclicity checks for constant and type definitions which we currently treat as axioms because a purely syntactic justification is unclear.