Keywords

figure a
figure b

1 Introduction

Programs that run in critical environments need to comply with strong safety guarantees. The minimal guarantee one expects for critical software is the absence of runtime failures. Sound static analyses can provide such guarantees statically, for every possible execution of a program, and in a fully automatic manner.

The static typing discipline found in the ML family of languages is such a static analysis technique, that brought strong safety guarantees to programs at a very low cost: well-typed programs cannot ā€œgo wrongā€ [48]. This soundness theorem for well-typed ML programs, however, does not preclude programs from abruptly ending with uncaught exceptions. Several analyses for ML-like languages have been developed to detect such undesirable behaviours, that were either leveraging type and effect systems [38, 54], or that were based on variants of control-flow analyses or set constraints [14, 15, 66,67,68]. The recent success of algebraic effects and their introduction in popular languages such as OCaml [37] has renewed the interest in the static detection of uncaught exceptions and effects.

Analysing uncaught exceptions in ML is a difficult problem, because data flow and control flow are interdependent. This is not only due to the first class nature of functions, but also due to the first class nature of exceptions themselves, e.g., they can be taken as parameters, recorded in data structures or in mutable references. Furthermore, exceptions can carry any value as argumentā€”including functionsā€”and new exceptions can be dynamically generated at runtime.

In this paper, we propose a static analysis for a higher-order language, in which exceptions are first-class values. The analysis is based on the abstract interpretation framework [9]. It is a forward value analysis that infers which values any program point can compute, and which exceptions they might raise. For this purpose, we introduce a novel abstract domain that can represent recursively defined sets of values. We define a widening operator for this abstract domain, that is responsible for finding recursive generalisations of solutions.

Our analysis leverages this abstract domain to represent both possible values and exceptions, thanks to the abstract exception monad. This monadā€”that can also be used as an abstract domainā€”is an abstraction of the exception monad, that collects all values and exceptions.

We define our analysis as a big-step monadic interpreter, written in the open recursive style, that was emphasised in the ā€œAbstracting Definitional Interpreterā€ approach [11]. Then, we obtain an effective analyser by applying a generic, dynamic fixpoint solver [6, 12, 24, 30, 59, 63]. We prove that our analysis is sound, under the soundness assumption of the fixpoint solver.

We extend the analysis to handle a large subset of the OCaml language. In particular, it supports the dynamic creation of exceptions, mutable state, modules and functors. The analysis is so far limited to sequential programs that do not perform system calls, do not use the Gc or Obj modules, and do not employ recursive modules, general recursive definitions of values, objects, classes, arrays, or floats. We implemented an OCaml prototype for this analyser. It reports the possibly thrown exceptions and an over-approximation of the data they carry, along with an abstraction of the call trace that led to the program point where the exception was raised. We discuss some implementation choices, and evaluate the precision and performance of our analyser on 290 programs, that include examples from the literature and from the OCaml compilerā€™s test suite.

2 Overview

Let us consider the classic example of the factorial function, as written below in a continuation passing style.

figure c

The function recursively calls itself with increasing values of its parameter , until the value is reached.

We are interested in finding which values (and exceptions) this program might return. To answer this question, we first need to find the possible continuations the function can be called with, and, importantly, we need an abstract domain in which we can express this set, or an over-approximation thereof.

With the abstract domain that we introduce in Ā§4, we can express such a set as the following abstract value:

$$\mu \alpha . \left\{ \mathsf {funs:}\, \left\{ (\lambda ^{} x . \, x) \mapsto \{\}; \; (\lambda ^{} x . \, k~(x * i)) \mapsto \{i \mapsto \{ \mathsf {ints:}\, [1,+\infty ] \} ; k \mapsto \alpha \}; \right\} \right\} $$

This abstract value represents a recursively-defined setā€”as indicated by the \(\mu \) constructorā€”that is locally named \(\alpha \). This set is composed of function closures, that can be either the identity function, or the function \(\lambda ^{} x . \, k~(x * i) \), considered in an environment where the variable i is bound to an integer that is greater or equal toĀ 1, and where the variable k is recursively bound to the local variable \(\alpha \), i.e., to a value of the set we are defining.

Our abstract domain can also express structural invariants on data, such as the one for red-black trees [52], that forbids red nodes from having red children:

$$ \mu \alpha . \left\{ \begin{array}{@{}l@{}} \mathsf {constructs:}\, \left\{ \begin{array}{@{}l@{}} E: ();\\ R: \left( \begin{array}{@{}l@{}} \{ \mathsf {constructs:}\, \{ E: (), B: (\alpha , \{ \mathsf {ints:}\, \top \}, \alpha ) \} \};\\ \{ \mathsf {ints:}\, \top \},\\ \{ \mathsf {constructs:}\, \{ E: (), B: (\alpha , \{ \mathsf {ints:}\, \top \}, \alpha ) \} \} \end{array} \right) ;\\ B: (\alpha , \{ \mathsf {ints:}\, \top \}, \alpha ) \end{array} \right\} \end{array} \right\} $$

Our abstract domain bears a strong similarity with the theory of equi-recursive types [56], in the sense that recursion is a core aspect of our definition. However, it differs from recursive types, as function types are absent: sets of closures are used instead. Moreover, it is parameterised by a non-relational abstract domain used to represent integers valuesā€”which is not possible with simple type systems.

We leverage our abstract domain and define a static analysis for a call-by-value \(\lambda \)-calculus with pattern matching, exception handling, and first-class exceptions (Ā§3). In this language, the order of evaluation is made explicit by let bindings, and pattern matching is exhaustive and non-ambiguousĀ [8]. These requirements drastically simplify the semantics of programs and their analysis. The analysis is defined as an abstract interpreter that performs a forward value analysis (Ā§5).

Based on this small abstract interpreter, we sketch (Ā§6) several extensions that we implemented to obtain a static analyser for a subset of OCaml programs. The implementation uses an intermediate language that is close to the one of Ā§3, into which we translated the OCaml typed abstract syntax tree. We evaluated the precision and performance of our analyser on 290 OCaml programs, written in a variety of styles (direct, CPS, monadic, etc.). We discuss these experimental results (Ā§7), cover related work (Ā§8), and finish with conclusive remarks (Ā§9).

3 A \(\lambda \)-calculus With Exceptions

We introduce as an intermediate language a \(\lambda \)-calculus with pattern matching and exception handling. Its syntax resembles the monadic normal form, where the order of evaluation is made explicit with let bindings.

Definition 1

Given \(\mathcal {C}\) a set of constructor symbols, we give the following inductive definition of patterns p,Ā q, and expressions t,Ā u,Ā r:

$$\begin{array}{rcl} p,~q \in \mathbb {P} &{}:\,\!:=\, &{} x\; \mid \;n \; \mid \;c(p_1,\ldots ,p_k) \; \mid \;p_1 +p_2 \; \mid \;p \setminus q\\ t,~u,~r \in \mathbb {T} &{}:\,\!:=\, &{} x\; \mid \;n \; \mid \;x_1 \mathbin { op }x_2 \; \mid \;c(x_1,\ldots ,x_k) \\ {} &{} \mid &{} \mu ^{} f . \, \lambda x . \, t \; \mid \;f~y\ \; \mid \;\textsf{let}~{x} = {t}~\textsf{in}~{u} \; \mid \;\textsf{raise}~{x} \\ {} &{} \mid &{} \textsf{match}~{t}~\textsf{with}~ p_1 \Rightarrow u_1 \,|\,\cdots \,|\, p_n \Rightarrow u_n \\ {} &{} \mid &{} \textsf{dispatch}~{t}~\textsf{with}~\textsf{val}~{x}\Rightarrow {u} \,|\, \textsf{exn}~{y}\Rightarrow {r} \end{array}$$

where n is a constant integer, c is a constructor of \(\mathcal {C}\), \(\mathbin { op }\) is a binary operation on integers, and where the pattern q cannot contain any complement \(p_1 \setminus p_2\).

We consider a pattern syntax and formalism inspired fromĀ [8]. The pattern disjunction \(p +q\) matches any value matched by p or q, and the pattern complement \(p \setminus q\) matches any value that is matched by p but not by q.

As in the OCaml typed AST, variables carry a type. We may write \({x}_{\tau }\) to denote that the variable \(x\) is of type \(\tau \). Patterns are linear, i.e., sub-patterns of constructor patterns cannot share variables. All functions are recursive by default. If \(f\) does not occur in the expression t, then we write \(\lambda ^{} x . \, t \) instead of \(\mu ^{} f . \, \lambda x . \, t \).

The values of this language are integer constants, constructors applied to values, and function closures, that contain an environment of values:

$$\begin{array}{rcl} v \in \mathbb {V} &{}:\,\!:=\, &{} n \; \mid \;c(v_1,\ldots ,v_k) \; \mid \;\langle E, \mu ^{} f . \, \lambda x . \, t \rangle \quad \text {where }\textrm{dom}\, E = \textrm{fv}(\mu ^{} f . \, \lambda x . \, t) \\ E \in \mathbb {E} &{}:\,\!:=\, &{} [] ~\mid ~ E, x \!\mapsto \! v \end{array}$$

Patterns induce a matching relation over values, that is described, with regard to a given environment \(E \), by recursion on patterns:

$$\begin{array}{r@{~~}c@{~~}ll} x&{} \mathbin {\prec \!\!\prec }_{\scriptscriptstyle {E}} &{} v &{} ~~\iff ~~ E (x)=v\\ c(p_1,\ldots ,p_n) &{} \mathbin {\prec \!\!\prec }_{\scriptscriptstyle {E}} &{} c(v_1,\ldots ,v_n) &{}~~\iff ~~ \bigwedge _{i=1}^n p_i \mathbin {\prec \!\!\prec }_{\scriptscriptstyle {E}} v_i\\ {p}+{q} &{} \mathbin {\prec \!\!\prec }_{\scriptscriptstyle {E}} &{} v &{} ~~\iff ~~ p \mathbin {\prec \!\!\prec }_{\scriptscriptstyle {E}} v ~\vee ~ q \mathbin {\prec \!\!\prec }_{\scriptscriptstyle {E}} v\\ {p}\setminus {q} &{} \mathbin {\prec \!\!\prec }_{\scriptscriptstyle {E}} &{} v &{} ~~\iff ~~ p \mathbin {\prec \!\!\prec }_{\scriptscriptstyle {E}} v ~\wedge ~ q \mathbin {\prec \!\!\not \prec }v \end{array}$$

We say that a pattern p matches a value v, denoted \(p \mathbin {\prec \!\!\prec }v\), iff there exists an environment \(E \) such that \( p \mathbin {\prec \!\!\prec }_{\scriptscriptstyle {E}} v\). In such case, we write \({E}\langle {p \mathbin {\prec \!\!\prec }v}\rangle \) the smallest environment such that \( p \mathbin {\prec \!\!\prec }_{\scriptscriptstyle {{E}\langle {p \mathbin {\prec \!\!\prec }v}\rangle }} v\).

Thanks to this pattern-matching formalism, we can focus on the class of programs where pattern matching is exhaustive and non-ambiguous, i.e.: In a term \(\textsf{match}~{t}~\textsf{with}~ p_1 \Rightarrow u_1 \,|\,\cdots \,|\, p_n \Rightarrow u_n\) where \(t\,:\,\tau \), we require that for any value \(v\,:\,\tau \), there exists a unique \(1 \le i \le n\) such that \(p_i \mathbin {\prec \!\!\prec }v\). The work presented inĀ [8] shows how to disambiguate patterns, i.e., how to make any pattern match non-ambiguous. We restrict ourselves to non-ambiguous patterns, because it simplifies both the dynamic semantics and the analysis of programs.

Fig. 1.
figure 1

Big-step semantics.

We present in FigureĀ 1 a call-by-value big-step semantics for our language. We write \(t \Downarrow _{\textsf{val}}v\) to denote that the expression term t reduces to the value v, and we write \(t \Downarrow _{\textsf{exn}}v\) to denote that the reduction of t raises an exception evaluated as v. In this language, any value can be raised as an exception. The evaluation rules are mostly standard. We briefly explain the rules for \(\textsf{match}\) and \(\textsf{dispatch}\).

The non-ambiguous pattern-matching simplifies the semantics of the term \(\textsf{match}~{t}~\textsf{with}~ p_1 \Rightarrow u_1 \,|\,\cdots \,|\, p_n \Rightarrow u_n\), as only one pattern can match the value ofĀ t, and thus only one branch is considered during the evaluation.

The rule Dispatch deals with exception handling: the evaluation of the term \(\textsf{dispatch}~{t}~\textsf{with}~\textsf{val}~{x_\textsf{val}}\Rightarrow {u_\textsf{val}}~|~\textsf{exn}~{x_\textsf{exn}}\Rightarrow {u_\textsf{exn}}\) first evaluates t. If t reduces to a value, then the value branch \(u_\textsf{val}\) is evaluated. Otherwise, if t raises an exception, the exception branch \(u_\textsf{exn}\) is evaluated. In both cases, the value or the exception is added to the environment of the corresponding branch.

4 An Abstract Domain for Regular Sets of Values

In this section, we define an abstract domain that is able to represent inductively defined sets of values of our programming language. It is parameterised over a non-relational, numeric abstract domain \(\mathbb {I} \), that provides a concretisation function \(\gamma _{\mathbb {I}}: \mathbb {I} \rightarrow \wp (\mathbb {Z}) \), a test for the abstract inclusion pre-order, and operations for union, intersection and widening, with the standard soundness conditions. For instance, the soundness of abstract union is stated: \(\gamma _{\mathbb {I}}(\textsf{I} _1) \cup \gamma _{\mathbb {I}}(\textsf{I} _2) \subseteq \gamma _{\mathbb {I}}(\textsf{I} _1 \sqcup _{\mathbb {I}} \textsf{I} _2)\).

The definition of our abstract domain follows:

Definition 2 (Abstract values)

[Abstract values]

$$ \begin{array}{ccll} \textsf{A} \in \mathbb {A} &{} :\,\!:= &{} \left\{ \mathsf {ints:}\, \textsf{I} ;\; \mathsf {constructs:}\, \textsf{C} ;\; \mathsf {funs:}\, \textsf{F} \right\} \; \mid \;\alpha \; \mid \;\mu \alpha . \textsf{A} &{} \text {(Abstract value)} \\ \textsf{I} \in \mathbb {I} &{} :\,\!:= &{} \text {any numeric abstract domain} &{} \text {(Abstract integers)} \\ \textsf{C} &{} :\,\!:= &{} \{\overline{c \mapsto (\textsf{A},\dots ,\textsf{A})}\} \; \mid \;\top &{} \text {(Abstract constructs)} \\ \textsf{F} &{} :\,\!:= &{} \{ \overline{\mu ^{} f . \, \lambda x . \, t \mapsto \textsf{E}} \} \; \mid \;\top &{} \text {(Abstract closures)} \\ \textsf{E} \in \mathbb {E} &{} :\,\!:= &{} \left\{ \overline{x \mapsto \textsf{A}} \right\} &{} \text {(Abstract environment)} \\ \end{array} $$

An abstract value, written \(\textsf{A} \), describes which integers it denotes (in the field \(\textsf{ints} \)), and which values whose head is a constructor it denotes (in the field \(\textsf{constructs} \)), and which function closures it denotes (in the field \(\textsf{funs} \)). The integer values are described by a numeric abstract domain that is taken as parameter.

The constructed values are described by a map whose keys are the possible head constructors of the values, and whose data are tuples of abstract values, that denote the possible values for all the arguments of that constructor. The constructed values might also be described by \(\top \), which means that the head constructor could be any constructor, and the arguments may be any value.

Similarly, the possible function closures are described by a map that associates possible codes of the function to abstract environments. The environments map free variables of the corresponding function code to abstract values, denoting the possible concrete values of these variables. The closures might also be described byĀ \(\top \), to represent any closure made from any function code with any environment.

Finally, we can construct recursive sets of values through the use of variablesĀ \(\alpha \), that are introduced by the \(\mu \) constructor of fixpoints.

The bottom value is \(\{\mathsf {ints:}\, \bot ;\;\mathsf {constructs:}\, \{\};\;\mathsf {funs:}\, \{\} \}\), and the top value is \(\{\mathsf {ints:}\, \top ;\;\mathsf {constructs:}\, \top ;\;\mathsf {funs:}\, \top \}\). We may completely omit some of the fields (\(\textsf{ints} \), \(\textsf{constructs} \) or \(\textsf{funs} \)) when they are associated with a bottom value.

This informal explanation is formalised in the concretisation function:

Definition 3 (Concretisation)

[Concretisation] Assume \(\varGamma \) is a finite mapping from variables to abstract values. The concretisation \({\gamma _{\varGamma }} : \mathbb {A} \rightarrow \wp (\mathbb {V}) \) is defined by \({\gamma _{\varGamma }} \left\{ \mathsf {ints:}\, \textsf{I} ;\; \mathsf {constructs:}\, \textsf{C} ;\; \mathsf {funs:}\, \textsf{F} \right\} = {\gamma _{}}(\textsf{I}) \cup {\gamma _{\varGamma }}(\textsf{C}) \cup {\gamma _{\varGamma }}(\textsf{F})\), where:

figure h

The definition is justified by the fact that the function \(\lambda S. {\gamma _{\varGamma ,\alpha :S}}(\textsf{A})\) is monotonic, and thus has a least fixed point, thanks to the Knaster-Tarski theorem. This is formalised by the following lemma:

Lemma 1

Consider the inclusion order \(\subseteq \) on \(\wp (S)\), and its pointwise extension on environments \(\varGamma \). For any abstract value \(\textsf{A} \), the function \(\lambda \varGamma . {\gamma _{\varGamma }}(\textsf{A})\) is monotonic.

The fact that our abstract values may represent sets of values that might not all have the same types may seem surprising, since our goal is, ultimately, to analyse strongly typed programs. The crux of the explanation lies in the fact that our abstract domain can only represent regular sets of values. If we restricted our abstract values so that they represent homogeneously typed values, it would be difficult to represent sets of values that are induced by a non-regular recursive typeā€”like the type of finger trees [23]ā€”or by generalised algebraic data types (GADTs). Indeed, one would need to find an over-approximation of such sets, and we would often approximate with the \(\top \) abstract value. The ability to describe regular sets of values that may not have all the same type gives us more freedom, and allows to find more precise approximations. For instance, we can represent finger trees as a recursive set whose values are either trees or fingers, although trees and fingers have distinct types. In practice, the \(\top \) value is never produced.

We write \(\textsf{A} _1 [ \alpha \leftarrow \textsf{A} _2 ] \) to denote the capture avoiding substitution. We write \({\gamma _{}}(\textsf{A})\) for \({\gamma _{[]}}(\textsf{A})\), i.e., when the environment is empty.

The unwinding of fixpoints preserves the concretisation of abstract values.

Lemma 2 (Unwinding)

[Unwinding] \( {\gamma _{}}(\mu \alpha . \textsf{A} ) = {\gamma _{}}(\textsf{A} [ \alpha \leftarrow \mu \alpha . \textsf{A} ]) \)

To define several operations on abstract values, we restrict them to well-formed values, using the standard contractiveness property for recursive typesĀ [16]:

Definition 4 (Contractiveness)

[Contractiveness] An abstract value \(\textsf{A} = \mu \beta _1 . \dots \mu \beta _n . \textsf{A} ' \) is \(\alpha \)-contractive if \(n \ge 0\) and \(\textsf{A} '\) does not start with \(\mu \) and is not the variable \(\alpha \).

Well-formedness requires that fixpoints must be contractive, that constructors are used with the correct arity, and that the environment in closures only define bindings for the free variables of the functions.

Definition 5 (Well-formedness)

[Well-formedness] An abstract value \(\textsf{A} \) is well-formed when the following conditions are satisfied:

  • For any \(\mu \alpha . \textsf{A} ' \) that occurs in \(\textsf{A} \), the abstract value \(\textsf{A} '\) is \(\alpha \)-contractive, and

  • For any \(c \mapsto (\textsf{A} _1,\dots ,\textsf{A} _n)\) that occurs in \(\textsf{A} \), the arity of c is n, and

  • For any \(\mu ^{} f . \, \lambda x . \, t \mapsto \textsf{E} \) that occurs in \(\textsf{A} \), \(\textrm{dom}\, \textsf{E} = \textrm{fv}(\mu ^{} f . \, \lambda x . \, t)\).

Well-formedness rules out the abstract value \(\mu \alpha . \alpha \), whose concretisation is the empty set. Well-formedness is preserved by substitution, provided contractiveness for the substituted variable is satisfied. This ensures that unwinding fixpoints preserves well-formedness. In the rest of this article, we only consider closed, well-formed abstract values.

For any abstract value \(\textsf{A} \), we can retrieve the subset of integer values (respectively, constructed values, or function closures) by unwinding the top-level \(\mu \)s if there are any, and eventually getting the ints field (respectively, constructs, or funs). This is formalised in the following definition for projection on integers:

Definition 6 (Projection on integers)

[Projection on integers] The projection on integers of a well-formed abstract value \(\textsf{A} \), written \(\textsf{A}.\textsf{ints} \), is defined as follows:

$$ \begin{array}{rcl} \left\{ \mathsf {ints:}\, \textsf{I} ;\; \mathsf {constructs:}\, \textsf{C} ;\; \mathsf {funs:}\, \textsf{F} \right\} .\textsf{ints} &{} = &{} \textsf{I} \\ (\mu \alpha . \textsf{A} ).\textsf{ints} &{} = &{} (\textsf{A} [ \alpha \leftarrow \mu \alpha . \textsf{A} ]).\textsf{ints} \end{array} $$

The definition for projection is well founded, thanks to the contractiveness of \(\mu \)s: only a finite number of unwindings is necessary. The projections \(\textsf{A}.\textsf{constructs} \) and \(\textsf{A}.\textsf{funs} \) are defined in a similar way. Projection on integers is sound, as it over-approximates the set of integers an abstract value contains:

Lemma 3 (Soundness of projection on integers)

[Soundness of projection on integers] \({\gamma _{}}(\textsf{A}) \cap \mathbb {Z} \subseteq {\gamma _{}}{(\textsf{A}.\textsf{ints})}\)

Projections for constructors and closures enjoy similar soundness properties.

4.1 Inclusion, Union and Intersection

Following the methodology employed in the context of recursive subtyping, we define the inclusion relation between abstract values as a co-inductive relation.

Definition 7 (Abstract inclusion)

[Abstract inclusion] The inclusion between abstract values, written \(\textsf{A} _1 \sqsubseteq \textsf{A} _2\) is defined as a co-inductive relation by the following rules:

figure i

In this definition, the relation \(\sqsubseteq _\textsf{I} \) is provided by the abstract domain on integers. The inclusion relation unfolds fixpoints when necessary, and otherwise compares each field (integers, constructed values, closures) separately, by treating the finite maps for constructed values and closures as disjunctions, i.e., by using the standard Hoare ordering. In practice, the inclusion test is implemented by transforming abstract values into graphs that resemble tree automata: each graph node corresponds to a sub-term of an abstract value, and \(\mu \)-nodes create cycles. Then, it suffices to check whether one automaton simulates the otherĀ [1, 16, 31].

Lemma 4 (Inclusion is a pre-order)

[Inclusion is a pre-order] The inclusion between closed, well-formed abstract values is a pre-order, i.e., a reflexive and transitive relation.

The definitions for abstract union and intersection are defined in the companion research reportĀ [34] in a similar way, as co-inductive relations that unwind fixpoints when needed.

The abstract operations enjoy the expected soundness properties:

Lemma 5 (Soundness of abstract operations)

[Soundness of abstract operations] For any closed, well-formed abstract values \(\textsf{A} _1\) and \(\textsf{A} _2\):

  • \(\textsf{A} _1 \sqsubseteq \textsf{A} _2\) implies \({\gamma _{}}(\textsf{A} _1) \subseteq {\gamma _{}}(\textsf{A} _2)\), and

  • \({\gamma _{}}(\textsf{A} _1) \cup {\gamma _{}}(\textsf{A} _2) \subseteq {\gamma _{}}(\textsf{A} _1 \sqcup \textsf{A} _2)\), and

  • \({\gamma _{}}(\textsf{A} _1) \cap {\gamma _{}}(\textsf{A} _2) \subseteq {\gamma _{}}(\textsf{A} _1 \sqcap \textsf{A} _2)\).

The proof of LemmaĀ 5 crucially relies on LemmaĀ 2, that proves that unwinding a recursive value preserves its concretisation.

Union and intersection are implemented by translating the values into graphs, on which union and intersection are easily computed. Then, we transform them back into trees with \(\mu \) nodes. Our implementation exploits the locally nameless representationĀ [5], where bound variables are encoded as de Bruijn indices. We leverage this canonical representation by hash-consing values and memoising the operationsĀ [13]. This has proved essential to obtain acceptable performance.

4.2 Widening

The widening, written \(\textsf{A} _1 \nabla \textsf{A} _2\), is a binary operator on abstract values that over-approximates the union of abstract values, and is used to approximate the Kleene fixpoint iterations. The role of the widening is central in abstract interpretation, as it serves two purposes. Firstly, the widening must find generalisations of abstract values, in order to find invariants. This part impacts the precision of the analysis, and relies on heuristics. Secondly, it must ensure the termination of the analysis, by enforcing a stability property: every widening chain must reach a limit in finite time. This part impacts the performance of the analyser.

In our abstract domain, the widening operator is responsible for finding regularities in abstract values and for creating \(\mu \) nodes. A similar idea was used in the analysis of Prolog programs using type graphsĀ [22], that are trees that contain cycles. Our widening draws inspiration from type graphs.

We now give the informal procedure to compute the widening of two abstract values \(\textsf{A} _1\) and \(\textsf{A} _2\). It operates in two phases. The first phase proceeds as follows:

  1. 1.

    Compute the union \(\textsf{A} _{12}\) of \(\textsf{A} _1\) and \(\textsf{A} _2\) where the widening of the numeric abstract domain is used, instead of the standard union. This ensures that the numeric parts of abstract values wonā€™t grow indefinitely.

  2. 2.

    Compute \(\textsf{A} _{\textsf{new}}\), which is a minimised version of \(\textsf{A} _{12}\). Minimisation is performed by an algorithm on tree automata, that produces a semantically equivalent abstract value, and whose size is smaller.

  3. 3.

    Compare the \(\textsf{A} _{\textsf{new}}\) and \(\textsf{A} _1\) (viewed as trees):

    • If the height of \(\textsf{A} _{\textsf{new}}\) is not greater than the height of \(\textsf{A} _1\), return \(\textsf{A} _{\textsf{new}}\);

    • If, for each construct and each code of closures, the maximal number of occurrences in each tree path of \(\textsf{A} _{\textsf{new}}\) is less than those occurrences in \(\textsf{A} _1\), or a user-provided threshold, return \(\textsf{A} _{\textsf{new}}\);

    • Otherwise, go to the shrinking phase.

Steps 2 and 3 allow the size of abstract values to grow enough, before a shrinking phase starts. In practice, this is important to find precise invariants.

The shrinking phase, which takes inspiration from the widening operation of type graphs, tries to shrink \(\textsf{A} _{\textsf{new}}\), by introducing \(\mu \) nodes at appropriate positions to ā€œfold the abstract value on itselfā€. It proceeds as follows:

  1. 1.

    Find clashes between \(\textsf{A} _1\) and \(\textsf{A} _{\textsf{new}}\), i.e., nodes that are reachable through the same path (possibly unwinding \(\mu \) nodes) in the two trees, and such that:

    • Either, the two nodes have different sets of head constructors or codes of functions: this means that the two nodes might differ semantically.

    • Or, the two nodes have different depths in the two trees: this means that some path was followed through a \(\mu \)-unwinding.

  2. 2.

    If no clash is found, then return \(\textsf{A} _{\textsf{new}}\).

  3. 3.

    If a clash is found, then we try to create a cycle in \(\textsf{A} _{\textsf{new}}\) by merging the clashing node with one of its ancestors:

    • We search for the closest ancestor of the clashing node that is semantically larger in the sense of the pre-order. If there is such an ancestor, then we merge it with the clashing node, thus creating a cycle.

    • If no such ancestor exists, we search for the closest ancestor that has at least the same head constructors and function codes as the clashing node, then we merge it with the clashing node too.

    • If no such ancestor exists, then we return \(\textsf{A} _{\textsf{new}}\) unchanged, which allows the abstract values to grow.

    We repeat this operation until no clashing node remains, or until a maximal number of iterations is reached. In the latter case, we truncate \(\textsf{A} _{\textsf{new}}\), i.e., we replace some nodes with \(\top \), so that it has the same height as \(\textsf{A} _1\).

In practice, we could not find any case where the final truncation is needed. We have observed that our widening operator finds precise generalisations in practice.

5 An Abstract Interpreter to Detect Uncaught Exceptions

To design our abstract interpreter, we took inspiration from the ā€œAbstracting Definitional Interpreterā€ approach [11]. This methodology prescribes to derive an abstract interpreter from a concrete big-step interpreter that computes in a monad, that is a parameter of the interpreter. Furthermore, the methodology fosters the use of the open recursive style: the interpreter should be a function that takes as extra parameter the function that was intended to be called recursively.

The first aspectā€”being parameterised by a monadā€”is motivated by the fact that one could use a monad that computes over abstractions of values. In Ā§5.1, we present a monad that is an abstraction of the exception monad. It is also an abstract domain, and is therefore well suited to define an abstract interpreter.

The second aspectā€”using open recursive styleā€”permits the use of dynamic fixpoint solvers [6, 12, 24, 30, 59, 63]. Such solvers compute post-fixpoints, i.e., over-approximations of solutions of systems of equations over abstract values, for which the set of equations might be discovered dynamically, while solving the equations. New equations can be discovered, for instance, when the control flow of a program depends on its data flow. This is the case of higher-order programs, as the function that can be called at a given call site can possibly result from a computation. We present in Ā§5.2 our abstract interpreter as a function that computes in the abstract exception monad, and is defined in open recursive style.

5.1 The Abstract Exception Monad

A big-step interpreter for a programming language with exceptions can be defined in an elegant manner using the exception monad, which we briefly recall. In the exception monad, a computation is either a success value, or an exception that carries some valueā€”typically of type exceptionā€”from the object language.

figure j

In this monad, the \(\boldsymbol{\textbf{raise}}\) function expresses the action of throwing an exception, while the \(\boldsymbol{\textbf{dispatch}}\) function, corresponds to the \(\textsf{dispatch}\) construct of our prototype language (Ā§3), and expresses the action of catching an exception.

figure k

The \(\boldsymbol{\textbf{raise}}\) function simply injects its argument into the exception case, whereas the \(\boldsymbol{\textbf{dispatch}}\) function takes two continuations, to handle, respectively, the success case, and the exception case, by performing a case analysis on the monadic value.

We can easily define a monad that mimics the behaviour of the exception monad, with the difference that it deals with abstractions of sets of (possibly exceptional) values, instead of mere exceptional values. The construction is based on the observation that \(\wp (\textsf{m}\,\beta ) \) is isomorphic to \(\wp (\beta ) \times \wp (\mathbb {V}) \), that can itself be abstracted into \(\wp (\beta ) \times \wp (\mathbb {A}) \) by using our abstract domain for sets of values. Thus, we define the abstract exception monad, written \(\textsf{m}^{\boldsymbol{\sharp }}\,\beta \), as follows:

figure l

The \(\boldsymbol{\textbf{return}^\sharp }\) operation records its argument as the set of possible values, and asserts that no exception is returned: the set of possible exceptions is \(\bot \). The \(\mathbin {\boldsymbol{>\!\!\!>\!\!=^\sharp }}\) operation retrieves the value part of its monadic argument and passes it to the continuation. The final value is composed of the value part that was produced by the continuation, and of the union of the exceptions that might have been raised by the monadic value and by the evaluation of the continuation. The functions \(\boldsymbol{\textbf{return}^\sharp }\) and \(\mathbin {\boldsymbol{>\!\!\!>\!\!=^\sharp }}\) satisfy the monad laws if \((\bot ,\sqcup )\) is a monoid.

The fact that \(\textsf{m}^{\boldsymbol{\sharp }}\,\beta \) is a monad does not suffice to use it in an abstract interpreter, though. We also need \(\textsf{m}^{\boldsymbol{\sharp }}\,\beta \) to be an abstract domain, i.e., one must decide when two monadic values are included in each other, and how to compute abstract unions, intersections, and widening.

Interestingly, the monad \(\textsf{m}^{\boldsymbol{\sharp }}\,\beta \) acts as an abstract domain as soon as \(\beta \) is an abstract domain: this is the standard cartesian product of abstract domains, where operations are defined pointwise. In practice, we only need to consider the instance \(\textsf{m}^{\boldsymbol{\sharp }}\,\mathbb {A} \), i.e., the domain of exceptional abstract values.

The remaining pieces that are needed to use \(\textsf{m}^{\boldsymbol{\sharp }}\,\beta \) in an abstract interpreter are the abstract versions of \(\boldsymbol{\textbf{raise}}\) and \(\boldsymbol{\textbf{dispatch}}\). They are defined as follows:

$$ \begin{array}{@{}l@{}} \boldsymbol{\textbf{raise}^\sharp }:\,\!: \mathbb {A} \rightarrow \textsf{m}^{\boldsymbol{\sharp }}\,\mathbb {A} \\ \boldsymbol{\textbf{raise}^\sharp }\textsf{A} = (\bot , \textsf{A}) \\ \end{array} \begin{array}{@{}l@{}} \boldsymbol{\textbf{dispatch}^\sharp }\,\,\!: \textsf{m}^{\boldsymbol{\sharp }}\,\beta \rightarrow (\beta \rightarrow \textsf{m}^{\boldsymbol{\sharp }}\,\mathbb {A}) \rightarrow (\mathbb {A} \rightarrow \textsf{m}^{\boldsymbol{\sharp }}\,\mathbb {A}) \rightarrow \textsf{m}^{\boldsymbol{\sharp }}\,\mathbb {A} \\ \boldsymbol{\textbf{dispatch}^\sharp }\;(B, \textsf{A})\;F\;G = F\,B \sqcup G\,\textsf{A} \\ \end{array} $$

The \(\boldsymbol{\textbf{raise}^\sharp }\) operation raises a set of possible exceptions, by recording the abstract value for exceptions in the set of possibly returned exceptions, and by returning the bottom value, since it can never return any value. It is the dual of \(\boldsymbol{\textbf{return}^\sharp }\).

The \(\boldsymbol{\textbf{dispatch}^\sharp }\) function executes the value continuation on the set of possible values, and executes the exception continuation on the set of possible exceptions, and then returns their abstract union in the domain of exceptional values.

We can easily show that the abstract operations compute over-approximations of their counterpart in the exception monad. Assume the type \(\beta \) is equipped with a concretisation function \(\gamma _{\beta } : \beta \rightarrow \wp (\mathbb {B}) \) for some set \(\mathbb {B}\). Then, we define the concretisation for the abstract monad:

$$ \begin{array}{l} \gamma _{\textsf{m}^{\boldsymbol{\sharp }}\,\beta } : \textsf{m}^{\boldsymbol{\sharp }}\,\beta \rightarrow \wp (\textsf{m}\,\mathbb {B}) \\ \gamma _{\textsf{m}^{\boldsymbol{\sharp }}\,\beta } (B, \textsf{A}) = \{ \textsf{Success}\, b \mid b \in \gamma _{\beta }(B) \} \cup \{ \textsf{Exception}\, v \mid v \in \gamma (\textsf{A}) \} \end{array} $$

The concretisation specifies that the first component of monadic values form the success values, and that the second component describe possible exceptions.

The soundness results for the abstract operations show that they compute over-approximations of their concrete counterparts:

Lemma 6

The following inclusions are satisfied:

  • \(\{ \boldsymbol{\textbf{return}}b \mid b \in \gamma _{\beta }(B) \} \subseteq \gamma _{\textsf{m}^{\boldsymbol{\sharp }}\,\beta } (\boldsymbol{\textbf{return}^\sharp }B)\)

  • \(\{ m \mathbin {\boldsymbol{>\!\!\!>\!\!=}}f \mid m \in \gamma _{\textsf{m}^{\boldsymbol{\sharp }}\,\beta _1}(M), f \in \gamma _{\beta _1\rightarrow \textsf{m}^{\boldsymbol{\sharp }}\,\beta _2}(F) \} \subseteq \gamma _{\textsf{m}^{\boldsymbol{\sharp }}\,\beta _2} (M \mathbin {\boldsymbol{>\!\!\!>\!\!=^\sharp }}F)\)

  • \(\{ \boldsymbol{\textbf{raise}}v \mid v \in {\gamma _{}}{(\textsf{A})} \} \subseteq \gamma _{\textsf{m}^{\boldsymbol{\sharp }}\,\mathbb {A}} (\boldsymbol{\textbf{raise}^\sharp }\textsf{A})\)

  • \(\left\{ \boldsymbol{\textbf{dispatch}}\, m \, f\, g \left| \begin{array}{l} m \in \gamma _{\textsf{m}^{\boldsymbol{\sharp }}\,\beta _1}(M), \\ f \in \gamma _{\beta _1\rightarrow \textsf{m}^{\boldsymbol{\sharp }}\,\beta _2}(F), \\ g \in \gamma _{\mathbb {V} \rightarrow \textsf{m}^{\boldsymbol{\sharp }}\,\beta _2}(G) \end{array} \right. \right\} \subseteq \gamma _{\textsf{m}^{\boldsymbol{\sharp }}\,\beta _2} (\boldsymbol{\textbf{dispatch}^\sharp }M\, F\, G)\)

where \(\gamma _{\beta _1\rightarrow \beta _2} (F) = \{ f \mid \forall X, \forall x \in \gamma _{\beta _1}(X), f\,x \in \gamma _{\beta _2}(F\,X) \}\).

5.2 A Monadic Abstract Interpreter in Open Recursive Style

In this section, we describe our whole-program static analyser. It infers an over-approximation of the values that a program might compute, and the exceptions that it might raise, with the possible values they carry. Although it analyses programs that can deal with first-class functions, it is not defined as a control-flow analyser [60], but rather as an abstract interpreter that performs a value analysis. The insight is the following: since functions are first-class citizens in the language, a value analysis also infers an approximation of the control flow. A value analysis will indeed compute which functions may be called at every call site.

Our analyser follows the open recursive style, and has the following type:

$$ (\mathbb {T} \rightarrow \mathbb {E} \rightarrow \textsf{m}^{\boldsymbol{\sharp }}\,\mathbb {A}) \rightarrow (\mathbb {T} \rightarrow \mathbb {E} \rightarrow \textsf{m}^{\boldsymbol{\sharp }}\,\mathbb {A}) $$

It takes as a parameter an analyser, that represents the information that has been discovered so far on the program, and produces an analyser as output, that exploits the input analyser to produce more analysis results, that are possibly less precise. The role of the fixpoint solver is to find a post-fixpoint of this functional. Similar approachesā€”leveraging fixpoint solvers to define static analysersā€”have been successfully used in other work on static analysis [4, 22, 50, 64].

Our abstract interpreter is defined in FigureĀ 2, where \(\llbracket t \rrbracket ^{\textsf{eval}}_{\textsf{E}}\) denotes the abstract value of type \(\textsf{m}^{\boldsymbol{\sharp }}\,\mathbb {A} \) obtained by analysing the program t under the abstract environment \(\textsf{E} \), and using the analysis function \(\textsf{eval}\) for recursive calls. Importantly, the analyser does not call \(\textsf{eval}\) for every recursive call. Instead, \(\textsf{eval}\) is only used when the analyser cannot be called on a strict sub-term. In practice, this means that \(\textsf{eval}\) is only used to analyse function calls. In every other place, we have the guarantee that the analysis is demanded on a strict sub-term, and a standard recursive call is performed. This strategy saves time in practice, as it lightens the burden of the fixpoint solver, that only needs to find post-fixpoints for function calls rather than for every program point.

Fig. 2.
figure 2

Definition of the abstract interpreter.

To analyse a variable, we return the abstract value found in the environment.

To analyse a construct, we retrieve the abstract values for every argument, and return the corresponding abstract value for that constructor, or \(\bot \) if some of the argument was \(\bot \), because of the eager semantics.

The analysis of an integer returns this integer injected in the integer domain. The analysis of binary operations on integers retrieves the integer parts of the abstract values for the two arguments, and returns the result of the transfer function from the integer domain for that binary operation.

The analysis of a function mimics the concrete semantics: it returns an abstract closure composed of the code of the function and its abstract environment.

The analysis of function calls is more interesting. If the abstract value for the argument is \(\bot \), then we return \(\bot \), because evaluation is eager. Otherwise, we retrieve all the possible closures for the value at the call position, and analyse their bodies by extending their environments with the abstract value for the argument, and with the abstract closure itself (we are dealing with recursive functions). The final result is the unionā€”at the level of the abstract monadā€”of the analyses of all the possible function bodies. Because the bodies of the functions that are analysed are not strict sub-terms of the original term \(x\,y\), we perform an external recursive call to the analyser, by using the \(\textsf{eval}\) parameter.

The analysis of let bindings chains the analyses of its two parts, and, because evaluation is eager, checks for emptiness before analysing the second sub-term.

The pattern matching construct is analysed by first analysing the scrutinee, and then analysing each branch of the match independently. For each branch, we retrieve the environment produced by matching the abstract value with the pattern (written \(p \mathbin {\prec \prec ^\sharp }v\)), and then we analyse the code of that branch if the matching was possible. Then, we take the unionā€”at the level of the abstract monadā€”of the analysis results from each branch. Notably, the exceptions that any branch might raise are reported in the final result. The definition for matching abstract values against patterns is available in the companion research reportĀ [34].

Analysing the raise construct is easy: a call to the \(\boldsymbol{\textbf{raise}^\sharp }\) function suffices. Finally, the analysis of dispatch amounts to calling the \(\boldsymbol{\textbf{dispatch}^\sharp }\) function from the abstract monad, on the analysis of the scrutinee, and on two continuations, that will analyse the codes of the two branches, if they are given non-\(\bot \) arguments.

5.3 Soundness of the Abstract Interpreter

We show that the abstract interpreter of FigureĀ 2 is sound, in the sense that it computes an over-approximation of the behaviour of programs.

Definition 8 (Behaviour of programs)

[Behaviour of programs] Let S be a set of evaluation environments: \({{\,\textrm{EVAL}\,}}_{S} t = \bigcup _{E \in S} \{ \textsf{Success}\, v \mid {E} \vdash {t} \Downarrow _{\textsf{val}}{v} \} \cup \{ \textsf{Exception}\, e \mid {E} \vdash {t} \Downarrow _{\textsf{exn}}{e} \} \)

The behaviour of a program t as a function \({{\,\textrm{EVAL}\,}}\) that takes a set of evaluation environments as input, and produces a set of values with a tag that indicates whether it results from normal or from exceptional evaluation.

Then, the soundness of the abstract interpreter follows:

Theorem 1 (Soundness)

[Soundness] Assume \(\textsf{eval}\) is a post-fixpoint, i.e., \(\llbracket t \rrbracket ^{\textsf{eval}}_{\textsf{E}} \sqsubseteq \textsf{eval} \, t \, \textsf{E} \) for every t and \(\textsf{E} \). Then, \({{\,\textrm{EVAL}\,}}_{\gamma (\textsf{E})} t \subseteq \gamma _{\textsf{m}\,\mathbb {A}} (\llbracket t \rrbracket ^{\textsf{eval}}_{\textsf{E}})\).

Proof

We have to show that for every \(E \in \gamma (\textsf{E})\), \(m \in \{\textsf{val},\textsf{exn}\}\) and \(v \in \mathbb {V} \), if \({E} \vdash {t} \Downarrow _{m} {v}\), then \(r \in \gamma _{\textsf{m}\,\mathbb {A}}(\llbracket t \rrbracket ^{\textsf{eval}}_{\textsf{E}})\), where \(r = \textsf{Success}\,v\) when \(m = \textsf{val}\), and \(r = \textsf{Exception}\,v\) when \(m = \textsf{exn}\). The proof proceeds by induction on the evaluation judgement, generalising over m and \(\textsf{E} \). The only interesting case is the one for function application, which exploits the induction hypothesis, the post-fixpoint property of \(\textsf{eval}\) and the soundness of abstract inclusion \(\sqsubseteq \). All other cases result from the soundness of the abstract operations and from induction hypotheses. Ā Ā Ā \(\square \)

The soundness theorem assumes that \(\textsf{eval}\) is a post-fixpoint, i.e., \(\llbracket t \rrbracket ^{\textsf{eval}}_{\textsf{E}} \sqsubseteq \textsf{eval} \, t \, \textsf{E} \). This property is ensured by the soundness of the fixpoint solver, that always returns a post-fixpoint. The function \(\textsf{eval}\) is, indeed, the result of the fixpoint solver called on the function \(\lambda \textsf{eval}. \lambda t. \lambda \textsf{E}. \llbracket t \rrbracket ^{\textsf{eval}}_{\textsf{E}}\).

6 An Abstract Interpreter for OCaml Programs

Based on the abstract interpreter of Ā§5, we implemented a static analyser for OCaml programs (version 4.14), that returns a map from top-level identifiers of the program to their abstract values. Our prototype and its test suite (see Ā§7) are available as a companion artefactĀ [35].

We have implemented several optimisations, that are crucial to obtain decent performance. For example, nodes of the analysed AST are indexed by program points using unique integers as identifiers. This enables efficient comparison of sub-terms and allows using efficient data structures like Patricia treesĀ [53]. Moreoverā€”this is of paramount importance for performanceā€”we perform hash-consing of abstract values and memoise the operations on these abstract values.

We present in the next sections some key implementation details that we needed to analyse OCaml programs.

6.1 Refinements With Respect to the Formal Presentation

The abstract interpreter we implemented follows the structure we have presented in Ā§5.2, but implements three more refinements, that we purposely elided to follow the presentation more easily. A thorough presentation of these refinements would go beyond the scope of the current paper.

Context sensitivity. Our analyser is context sensitive: we implemented a form of call site sensitivity, that is akin to an abstraction of the call stack. FollowingĀ [50], we retain full sensitivity until the list of call sites becomes maximal, i.e., when a program point appears more than once in that list, which may indicate a recursive call to some function. In addition, we always remember the last call site. In practice, the list of call sites is an additional parameter to the abstract interpreter. FollowingĀ [50] again, we use this list of call sites to decide when widening on the environments should be performed: it is performed only when \(\textsf{eval}\) is called on a maximal list of call sites. The same list of call sites is also used to derive dynamic exception names and abstract pointers (see Ā§6.4 andĀ Ā§6.5).

Flow sensitivity. Our abstract interpreter is able to exploit information that is learned when a branch in a \(\textsf {match}\) is taken, or when branching on an arithmetic test. For example, in the program \( \textsf{match}~{(x, y)}~\textsf{with} \; (\textsf{None}, \mathtt {\_}) \Rightarrow x \; \mid \; \mathtt {\_} \Rightarrow t \), our analyser is able to refine the possible environments, by taking into account that \(x = \textsf{None}\) in the first branch, and that this first branch necessarily returns the value \(\textsf{None}\). This is done by performing a backward analysis of the scrutinee (x,Ā y). This backward analysis infers an over-approximation of the environment, knowing that the scrutinee successfully matched against the pattern \((\textsf{None}, \mathtt {\_})\).

Dynamic partitioning. Finally, we have employed a form of dynamic partitioning to avoid conflating some analyses results, that could degrade precision. Based on a notion of similarity on the shapes of abstract values found in environments, we decide whether to conflate contexts or not. The technique is inspired by the silhouettes used in shape analysis [39].

6.2 Transformation of Typed OCaml ASTs

The actual language that our interpreter takes as input is more complex than the one we presented in Ā§3, but undoubtedly simpler than the OCaml AST. The main differences between our intermediate language and the OCaml AST, is that we deal with only one construct for pattern matching, and only one construct for exception handling, and that those two constructs implement orthogonal features in our language. This is in contrast with OCaml ā€™s and , that conflate pattern matching with exception handling. The transformation into our two constructs is mostly straightforward, and greatly simplifies the job of the static analyser.

Our intermediate language makes the evaluation order explicit using let bindings. While the evaluation order in OCaml is generally unspecified, we did our best to mimic the choices that the OCaml compiler makes.

We added specific application nodes for OCaml primitives. To ensure they are called with the correct arity, we inserted \(\lambda \)-abstractions when they were partially applied, or additional application nodes when they were given more arguments than expected. We also handled specifically the short-circuiting primitives on boolean expressions && and ||, as they change the evaluation order.

We kept the n-ary application nodes of the OCaml AST (instead of the binary applications from Ā§3), as this is important for the semantics of labelled/optional function arguments. Nevertheless, the transformation from the OCaml AST into our intermediate language needed a lot of care and effort. In particular, missing labelled arguments required the insertion of \(\lambda \)-abstractions, which can be particularly subtle when interacting with optional arguments.

6.3 Pattern Disambiguation

The last major difference between OCaml and our intermediate language is the exhaustive and non-ambiguous requirements on pattern matching. These properties not only simplify the semantics of our intermediate language, but also facilitate the analysis of programs. Indeed, each branch of the pattern-matching can be analysed independently of the other ones, whereas in OCaml, branches must be considered in order, until one pattern matches the inspected value. The OCaml type-checker still provides warnings to verify the utility of each branch and the exhaustiveness of the overall pattern matching.

Enforcing exhaustive and non-ambiguous pattern matchings in OCaml would require to use of cumbersome patterns, and, furthermore, it is not always possible to write such patterns in OCaml. It is, indeed, allowed to match on values whose types may have an infinity of constructors, e.g., arrays, strings, or extensible variant types (see Ā§6.4 for details). To reach these requirements, we extend the language of patterns with a complement \(p \setminus q\)Ā [8]. A value v matches a pattern \(p \setminus q\) if and only if it matches p but not q. In an ordered pattern matching \(\textsf{match}~{t}~\textsf{with}~ p_1 \Rightarrow u_1 \,|\,\cdots \,|\, p_n \Rightarrow u_n\), we can express that the value v of the term t matches the ith pattern, unambiguously. It suffices to add that v does not match any of the preceding patterns \(p_j\) with \(j<i\), i.e., v matches \(p_i\setminus \left( \varSigma p_j \right) \mathbin {\prec \!\!\prec }v\).

The method presented inĀ [8] shows how to solve the disambiguation problemĀ [32]. It relies on the notion of pattern semantics \(\llbracket p \rrbracket \) that is the set of values matched by a pattern: \(\llbracket p \rrbracket = \{ v\in \mathbb {V} \mid p \mathbin {\prec \!\!\prec }v \}\). The idea is to reduce any pattern p into a purely disjunctive pattern q, i.e., a pattern containing no complements \(\setminus \), while preserving its semantics : \(\llbracket p \rrbracket = \llbracket q \rrbracket \). The reduction relies on rewriting rules that correspond to algebraic laws of set theory: a constructor c behaves like a labelled cartesian product, the disjunction \(+\) like set union, and the complementĀ \(\setminus \) like set difference. Note that the pattern language proposed in Ā§3 conflates the different forms of OCaml constructors (constructor variant, polymorphic variant, records, arrays and tuples) as they behave similarly w.r.t. to their semantics.

In order to fully reduce a pattern, the method also relies on the observation that a variable \({x}_{\tau }\) of a variant type \(\tau \) must be matched by a value whose head is a constructor of the type \(\tau \). Therefore, the semantics of this variable \({x}_{\tau }\) can be described as the union of semantics of all constructor instances ofĀ \(\tau \): \(\llbracket {x}_{\tau } \rrbracket =\bigcup _{c\in \mathcal {C}_{\tau }} \llbracket c(z_1,\ldots ,z_n) \rrbracket \), where \(\mathcal {C}_{\tau }\) is the finite set of constructors of co-domain \(\tau \). Similarly, the utilityĀ [40] approach, implemented in the OCaml compiler, relies on the ability to enumerate all the constructors of a type to provide a non-ambiguous description of the useful patterns. For types that may not be finitely described, the semantic approach can still be used to partially reduce the complementsĀ [7]. We keep anti-patternsā€”patterns of the form \(x\setminus q\) where q contains no complementsā€”when there exists a value v such that \(x\setminus q \mathbin {\prec \!\!\prec }v\).

Finally, to guarantee the exhaustiveness of pattern matching, it suffices to add a rule \(z\setminus (p_1 +\cdots +p_n) \Rightarrow \textsf{raise}~{\mathsf {Match\_failure}}\) when necessary. Again, generating such a non-ambiguous rule, for data types that may not be finitely described, is only possible thanks to pattern complements.

6.4 Dynamic Exceptions

The exception type in OCaml is an extensible variant type: it can be dynamically extended with new variant constructors. This means that new exception constructors are dynamically generated during the execution of programs. Although this section focuses on the exception type, the techniques we present apply to any extensible variant type as well.

To model the dynamic behaviour of type extension, we introduce dynamic constructors, written \(\overline{c}\), that, unlike static constructors c, are dynamically associated to a variant name d during the evaluation. We update the language ofĀ Ā§3 and its semantics to support these dynamic constructors (FigureĀ 3).

Fig. 3.
figure 3

Changes to support dynamic exception naming (excerpts).

The \(\mathsf {let~exception}~\overline{c} ~\textsf{of}~ {\tau _1\mathrel {*}\cdots \mathrel {*}\tau _n}~\textsf{in}~{t}\) construct defines the new exception constructor \(\overline{c}\), that is dynamically bound to a fresh variant name in the sub-termĀ t. The exception alias construct \(\mathsf {let~exception}~\overline{b} = \overline{c}~\textsf{in}~{t}\) defines the exception constructor \(\overline{b}\), that is bound in the sub-termĀ t to the variant name of \(\overline{c}\). Constructed values can now have a dynamic variant name as their head constructor.

To account for the generative aspect of dynamic constructors, the evaluation rules now carry an execution state \(S \), that contains the set of the already generated variant names. These are akin to the time-stamps from the CFA literature [25, 44], that are used to allocate data in memory locations. In the analysis, we use an over-approximation \(\delta \) of the list of call sitesā€”that we used already in Ā§6.1 to control the widening strategyā€”to give abstract names \((c,\delta )\) to dynamic constructors.

Finally, as the variant name of an exception constructor is resolved dynamically, the pattern matching relation depends on the evaluation environment \(E \): \(\overline{c}(p_1,\ldots ,p_n) \mathbin {\prec \!\!\prec }d(v_1,\ldots ,v_n)\) if and only if \(E (\overline{c})=d\), and \(p_i \mathbin {\prec \!\!\prec }v_i\) for all \(i\in [1,n]\).

As the exception type is extensible, a finite number of constructor patterns never forms an exhaustive set of patterns for the exception type. Therefore, the utility approach on pattern matchingĀ [40] used in OCaml for exhaustiveness checking cannot provide an exhaustive list of non-ambiguous counter-examples: that list is not known statically. In contrast, the disambiguation approach fromĀ Ā§6.3 is particularly well suited to such types, by leveraging anti-patternsĀ [7]. Moreover, the equality of two exception constructors \(\overline{b}\) and \(\overline{c}\) of the same arity can only be resolved dynamically. Therefore, there is no way to statically prove, or disprove, the utility of a pattern \(\overline{b}(q_1,\ldots ,q_n)\) against a pattern \(\overline{c}(p_1,\ldots ,p_n)\). On the other hand, in our pattern formalism, we can simply write \(\overline{b}(q_1,\ldots ,q_n) \setminus \overline{c}(p_1,\ldots ,p_n)\) to guarantee the non-ambiguity between the two.

6.5 Mutable Records and Global State

OCaml supports mutable records. While immutable records can be modelled in the programming language of Ā§3 in the form of constructsā€”an immutable record is a variant with a single caseā€”mutable records require extending the semantics with a global memory heap \(S \) (FigureĀ 4).

Fig. 4.
figure 4

Changes to support mutable records (excerpts).

Heaps are maps from memory locations \(\ell \) to record blocks. Record blocks are structured memory blocks, that contain values for all the registered fields of the record. The standard notion of reference can be modelled as a mutable record with a single field. This is exactly how the type of references is defined in OCaml.

We adapt the big-step semantics in a standard way, so that it takes a heap as input and returns an updated heap as output. The evaluation rules for record creation, access, and update, either query or modify the memory heap as expected.

OCaml features pattern matching on mutable records. We adapt the rules for pattern matching, so that matching on a mutable record first queries the memory heap to retrieve the values for the fields of the record, before matching continues.

To analyse programs that involve mutable records, we add a new field to abstract values, that contains the possible abstract locations \(\ell ^{\sharp }\) a value might be equal to. Abstract locations denote sets of concrete locations. Similarly to the dynamic extension constructors of Ā§6.4, fresh abstract locations are chosen by following a naming scheme that is based on the abstract call stack.

The abstract interpreter is easily adapted to support global state, by lifting the abstract exception monad to the state monad, where states are abstract heaps. Abstract heaps map abstract locations to abstract record blocks, that themselves map record fields to abstract values. The operations on abstract heaps and the transfer functions on records are standard, and elided from the presentation.

6.6 Modules and Functors

The OCaml language includes an expressive module system [36], that supports hierarchical structures, higher-order functors, and first-class modules. In this section, we give the reader the main insights for the analysis of OCaml modules.

First, we consider an untyped semantics of modules, i.e., we do not propagate type information. In particular, we do not take type abstraction boundaries into account. We carefully to keep track of module coercions, however: signature ascriptions may have, indeed, a computational content, as they can remove some module fields. Coercions are automatically applied at functor applications to ā€œreshapeā€ the functor argument. Coercions distribute on functors, contravariantly on their formal arguments, and covariantly on their results.

Embracing further the untyped nature of our approach, we made the choice of having a single class of values, that comprises both values from the core language and values for module structures and functor closures. This simplifies both the concrete semantics (for example, transfers from the module language to the core language and back are no-ops), and the design of the abstract domain. As we sketched in the previous sections, it suffices to add new fields to abstract values to describe the possible structures and functor closures.

We represent structures as unordered records, i.e., maps from field names to values. Functor closures hold the functor code, an environment, and coercions for the argument and the result, that shall be applied when the functor is called.

Importantly, the support of dynamic exceptions (Ā§6.4) was required to support functors, since an exception might be declared in a functorā€™s body: this leads to the creation of a fresh exception every time this functor is instantiated.

The analysis functions for the core language and the module language, of types \(\mathbb {T} \rightarrow \mathbb {E} \rightarrow \mathbb {A} \) and \(\mathbb {M}\rightarrow \mathbb {E} \rightarrow \mathbb {A} \), are mutually recursive. Still, the approach of using a fixpoint solver to define our abstract interpreter remains applicable. The two functions can be transformed into a single function of type \((\mathbb {T} +\mathbb {M})\rightarrow \mathbb {E} \rightarrow \mathbb {A} \), then given to the solver, and split back into two functions. Our untyped approach was again crucial, as we could keep a single type of abstract values, and a single type of abstract environments, which made the previous transformation possible.

7 Experiments

We tested our prototype analyser for OCaml programs on 290 programs, that range from small, manually written programs, to larger examples extracted from the literature or from the OCaml compilerā€™s test suite. The test programs include some classic functions such as the factorial program from Ā§2, Takeuchiā€™s function, McCarthyā€™s 91 function, fixpoint combinators, programs that compute over church numerals, transformations of abstract syntax trees for arithmetic expressions or logical formulas, and the algorithm for Knuth-Bendix completion of rewriting systems. The test suite covers a large array of coding styles, e.g., direct style, continuation-passing style, monadic style, or imperative style, and exhibits different language features, e.g., assertions, exception-based control flow, GADTs and non-regular types, polymorphic recursion, second-order polymorphism, etc.

We present in TableĀ 1, a selection of the test results on some key examples. The complete test results are reproducible via the companion artefactĀ [35]. The experimental results are encouraging, both in terms of performance and precision.

Table 1. Experiments: size of the programs, analysis time (with minimisation disabled, and enabled). They are sorted by program decreasing size.

In terms of precision, our analyser infers the best achievable abstract values on several programs: For McCarthyā€™s 91 function mc91, the result is shown to be greater than 91; for the skolemisation of logical formulas skolemize, the analyser correctly infers the form of returned terms, i.e., they cannot contain existential quantifiers. For other programs, the analyser only infers an over-approximation: for the red_black_tree program, it correctly infers the general shape of trees, but cannot infer the structural invariant that no red node has red children.

The map_merge example calls the functor of finite maps from the standard library, builds several maps, and calls the function on those maps, that merges the maps. The function has the following signature:

figure r

Its first argument specifies what should be done when a key/value pair is found in one of the maps, or in both. This argument is never called for keys that are absent in both maps, i.e., the case where the second and third arguments are both equal to is unreachable. OCaml programmers often write in the corresponding pattern matching branch. The analyser infers that the exception is never raised, which means that this branch cannot be reached. The analyser cannot show, however, that every assertion present in is satisfied: in the re-balancing function for pseudo-balanced trees, assertion failures are reported, because the analyser cannot infer that the heights that are recorded in the trees are strictly positive.

In terms of performance, most examples, and even some large programs, are analysed in a couple of seconds, or in less than a second. In contrast, some examples like boyer need approximately one hour for the analysis to terminate. boyer is a tautology checker, that is run on a large formula (its definition takes about 1000 lines). This formula, of mutable type, requires the creation of several hundreds of abstract pointers, which makes abstract operations on abstract heaps very costly. If we reduce context sensitivity to ā€œthe last call siteā€, fewer abstract pointers are created, and the analysis completes in 31Ā s. This suggests that context sensitivity choices for naming abstract pointers need further investigation.

Our experiments show that the minimisation of abstract values during widening and unions (Ā§4.2) may impact performance positively or negatively. For instance, for AST transformations like skolemize and negative_normal_form, minimisation decreases the analysis time from about 45Ā m down to a few seconds. For boyer, however, minimisation incurs a heavy cost, as it doubles the analysis time. Further investigations are needed to reduce the cost of minimisation.

8 Related Work

The static detection of uncaught exceptions for ML programs has been the topic of many related work. We only discuss a selection of them, and some results on static analysis of functional programs that are also relevant to the current work.

Set Constraints. Several static analyses for functional programs were based set constraints [21]. The principle is to transform a program into a constraint, that features unions, intersections, negations, and a form of conditional constraint. Then, the constraint is simplified and given to a solver, from which the analysis result is obtained. FƤhndrich and his coauthors built a exception analysis tool that infers types and effects for SML programsĀ [14, 15] using the BANE constraint analysis engine, using a mix of set constraints and type constraints.

Type and Effect Systems. Pessaux and Leroy have developed ocamlexc [38, 54, 55], a tool that detects uncaught exceptions in OCaml programs. They use a type and effect system to analyse programs modularly. Their analyser extends unification-based type inference, and makes use of row variablesĀ [57] and polymorphism to produce precise types for functions. They type variants structurally using equi-recursive types. Recursion may also occur through the effect annotations on arrow types. They also describe an algorithm to improve the accuracy of their analysis, that uses polymorphic recursion for row variables. The programming language Koka Ā [33] also leverages row variables to type algebraic effects. Recently, de Vilhena and PottierĀ [62] devised a type system based on row variables for a language that supports the dynamic creation of algebraic effects.

Control-Flow Analyses. An important family of analyses for higher-order programs are control-flow analyses (CFA) [19, 45, 51, 60, 65]. The goal of CFA is to determine which functions might be called at a call site, and on which arguments. CFA can be expressed as instances of abstract interpretation [44, 46, 47, 50]. CFA can easily be extended to analyse exceptions. Yi developed an abstract interpreter that detects uncaught exceptions in SML Ā [66,67,68]. It implements an analysis that is close to a 0-CFA analysis extended to support exceptions.

Abstract Domains in CFA. Most previous work on CFA share a common representation for abstract values: Although they need to represent some inductively defined sets, they refrain from using a native device to express fixpoints, such as our \(\mu \) constructor. Instead, cyclic definitions are encoded using indirections through abstract pointers, that point to an abstract heap. For example, the inductive set of continuations from Ā§2 is expressed as follows in CFA domains:

$$ \begin{array}{l} \left\{ \mathsf {funs:}\, \left\{ (\lambda ^{} x . \, x) \mapsto \{\}; \; (\lambda ^{} x . \, k~(x * i)) \mapsto \{i \mapsto p_{i} ; k \mapsto p_{k}\} \right\} \right\} \\ \text {where:} \quad \begin{array}{l} \hat{h}(p_{i}) = \{ \mathsf {ints:}\, [1,+\infty ] \} \\ \hat{h}(p_{k}) = \left\{ (\lambda ^{} x . \, x) \mapsto \{\}; \; (\lambda ^{} x . \, k~(x * i)) \mapsto \{i \mapsto p_{i} ; k \mapsto p_{k}\} \right\} \end{array} \end{array} $$

In this abstract value, the closuresā€™ environments contain the pointers \(p_i\) and \(p_k\), that are defined in the abstract heap \(\hat{h}\). This abstract heap contains a cycle, since \(p_k\) is used in the definition of the abstract value pointed by \(p_k\). This is in contrast to our approach, where we make use of \(\mu \) nodes to introduce cycles directly, without referring to a heap. We only use the abstract heap for mutable data. In CFA domains, all data (constructs, closures, etc.) are ā€œabstractly allocatedā€ in the abstract heap, regardless of whether they are mutable or not.

A benefit of the approach with heap indirections is that abstract values have a bounded height, and cycles need no special treatment: The equality of abstract pointers is used to compute on abstract values. While this makes the operations of CFA abstract domains easy to define, using pointer names limits drastically the detection of semantically equivalent values. We argue that our approach allows to detect more semantics inclusions, therefore decreasing the number of iterations of the analysers, at the cost of more complex abstract domain operations.

Tree Grammars. Several analyses for functional languages have been defined using tree grammars. For example, ReynoldsĀ [58] defined an analysis for pure first-order LISP using data sets, i.e., tree grammars that denote the possible outputs of function symbols. Extended tree grammars, i.e., grammars with selectors of the form \(X \rightarrow Y.hd\), have been used by Jones and his coauthors to analyse full LISP [28], and, later, strict and lazy \(\lambda \)-calculi [26, 27]. From a \(\lambda \)-term, they produce tree grammars with selectors, that denote the possible inputs and outputs of function symbols. Selectors can then be eliminated in order to simplify the grammars. Deterministic tree grammars have been identified as an abstract domain to recast analyses based on set constraints into the abstract interpretation framework [10].

Tree Automata. Generalising string automata, tree automata are an established formalism to represent sets of trees. They have been used to define static analysers for term-rewriting systems (TRSs) [3] and higher-order programs [20]. They have been extended to lattice tree automata to support arbitrary non-relational abstract domains at their leaves [17, 18], and improve the performance of analysers for TRSs. Recently, tree automata were combined with relational numeric abstract domains [29], to express relations between scalar data contained in trees. Recent work report on the design of relational domains for algebraic data types [2, 61].

Cyclic Abstract Domains. Type graphs [22] are a form of deterministic tree grammars, that are represented as cyclic graphs with no sharing, i.e., trees with cycles. They have been used to analyse Prolog programs. We used a similar graph-based representation as an intermediate form to compute union, intersection and widening. We use, however, a term-based representation with binders as our main representation, as it allows easy and efficient hash-consing and memoisation [13]. Our widening operator (Ā§4.2) is inspired by the one from type graphs.

Mauborgne [41,42,43] studied graph-based abstract domains for sets of trees, and defined ways to have minimal, canonical representations of such abstract values. Using Mauborgneā€™s structures natively could improve our analyserā€™s performance, as we could avoid translating back and forth from terms to graphs.

Finally, recursive types [56] were a strong inspiration for the abstract domain ofĀ Ā§4. Recursive types have been thoroughly studied in the context of subtypingĀ [1, 16, 31], where polynomial algorithms have been devised to decide inclusion. They proceed by translating types into variants of tree automata, that can also deal with the contravariance of arrow types.

Fixpoint Solvers. To the best of our knowledge, Le Charlier and HentenryckĀ [6] were the first to exploit a dynamic fixpoint solvers to define static analysers. They used the top-down solver to analyse Prolog programs. The same approach has been followed for the Goblint static analyser for C programs [59, 64], and for the analysis of WebAssembly programs [4]. Recent work introduced combinators to define dynamic fixpoint solvers in a modular manner [30]. Several dynamic fixpoint solvers have been successfully formally verified [24, 63].

9 Conclusive Remarks and Future Work

We have introduced a \(\lambda \)-calculus that features pattern matching primitives and exception handling, in which exceptions are first-class citizens. We have presented a static analysis for this language, in the form of a monadic abstract interpreter, that can be used as an effective static analyser. This analyser detects uncaught exceptions, and provides a description of the values that a program may return. The abstract interpreter relies on a generic abstract domain, that is parameterised over a domain for scalars, and that can represent regular sets of values of our programming language. This is achieved by a fixpoint constructor in the syntax of abstract values, that denotes an inductive set of values.

The abstract interpreter is defined in an open recursive style, where the recursive knot is tied by calling a dynamic fixpoint solver. Importantly, the analyser does not call the solver for every recursive call: it performs standard recursive calls on strict sub-terms, but calls the solver to analyse function calls.

Based on this approach, we implemented a static analyser for OCaml programs. We presented some extensions of our formalism to support several core features of OCaml, including dynamic generation of exceptions, mutable records, the module system. Our analyser starts with transforming the OCaml typed AST into a simpler language where evaluation order is explicit. This transformation required a lot of care and demanded a substantial implementation effort. One key aspect of this transformation is the disambiguation of pattern matching, as we chose to work with an exhaustive and non-ambiguous pattern matching primitive in order to simplify the analysis of programs.

Our experiments on 290 OCaml programs show some encouraging results, both in terms of performance and precision. Still, some improvements are needed for the analysis to be applicable to larger code bases. In particular, the minimisation of abstract values requires some more study and fine tuning: while it plays a crucial role to analyse some examples in a reasonable time, it can also severely undermine the analyserā€™s performance in some other cases.

At the moment, the analyser can deal with whole programs only. To analyse libraries more modularly, we plan to experiment with generating abstract values that over-approximate the inputs of a libraryā€™s function, based on their types. In the near future, we also plan to extend the analyser with OCaml features that are yet to be supported (e.g., arrays, laziness, floats, objects, recursive modules, interactions with the operating system, etc.), most of which will require substantial formalisation and implementation efforts. Recently introduced features, such as algebraic effects and one-shot continuations, are also on our agenda, and are likely to raise interesting challenges.

Finally, we hope that our abstract interpreter can be extended to perform other kinds of static analyses for OCaml programs, such as a purity analysis, or the detection of whether the behaviour of a program might depend on the order of evaluation. We would also like our implementation to serve as a basis for experimenting with recent relational domains for trees and scalars [2, 29, 61], and with relational analyses of functional programs [49].