Keywords

1 Introduction

Anti-unification problems (a.k.a. generalization problems) consist in finding a least general generalization (lgg) of two or more given expressions. This problem has interesting applications in computer science and software engineering, such as, symbolic mathematical computing [21], proof generalization [10], clone detection [8], among others; an overview is [6]. Early proposals to apply generalization for analyzing and improving programs by syntactic manipulations was given by Plotkin [12] and Reynolds [13].

We are interested in the anti-unification problem for languages with binders, such as the lambda-calculus, the pi-calculus, or the more general nominal language [11]. For instance, \(\lambda x. Z\) is a generalization of the lambda-expressions \(\lambda a.app(a,a)\), \(\lambda a.\lambda b. a\), and \(\lambda c.c\). In fact, from \(\lambda x. Z\) one can retrieve any of the three expressions in the set by considering the appropriate instance of Z (where capturing is permitted), modulo renaming of bound variables: \(Z\mapsto app(x,x)\), \(Z\mapsto \lambda b.x\) and \(Z\mapsto x\), respectively.

In the context of languages with recursive let (letrec), techniques for solving anti-unification problems would allow, for instance, to identify the program scheme \(\texttt{letr}~b.(\lambda x.N); a.(\lambda x.M)~\texttt{in}~ b(y)\) as a generalization of the program [1]

$$\begin{aligned} \texttt{letr}~&even.(\lambda x. \textsf{if}\text {-}\textsf{else }\ (x =0)\ (\texttt{true})\ ( odd(x-1)));\\&odd.(\lambda x. \textsf{if}\text {-}\textsf{else }(x =0) (\texttt{false})( even(x-1)))\\&~\texttt{in}~ (even~y) \end{aligned}$$

or even identify both fragments of programs as possible clones [8].

In general, and as illustrated above, reasoning and automated deduction in higher order languages often require – as a very basic operation – to identify expressions up to \(\alpha \)-equivalence. This means expressions are identified if they are syntactically equal up to a renaming of bound variables (which represent the binding structure). In addition, one has to have in mind that the letrec construct also satisfies laws like commutativity and associativity of its environment (e.g. we could permute the environment \(b.(\lambda x. N); a.(\lambda x.M)\) as \( a.(\lambda x.M); b.(\lambda x. N)\) above), which will be working in combination with binding primitives (i.e., also rename the bindings within the environment obtaining, e.g., \(c.(\lambda x.M'); d.(\lambda x. N')\)), and they also may occur nested.

Checking expressions for \(\alpha \)-equivalence is an operation that is often performed on large and complex expressions. Ad-hoc algorithms for checking \(\alpha \)-equivalence of such expressions are worst-case exponential due to searching for all possible permutations and renamings. An approach to handle \(\alpha \)-equivalence in deduction systems is to use nominal techniques [5, 11], where the focus is to ease formula specification and deduction rather than speeding up \(\alpha \)-equivalence checking. In general, checking \(\alpha \)-equivalence with the language extended with letrec using nominal techniques is a GI-hard problem [18]. Here, we follow the nominal approach to handle binding of names and their renaming.

In [17] we have proposed a semantic approach to anti-unification based on nominal techniques which uses atom-variables, and significantly improves an existing approach [4] to anti-unification for languages with binders, since it provides a finitary set of least general generalizations. In this work we propose a simplification of this semantic approach to a nominal language extended by the letrec construct, which we call \(\texttt{NLL}_X\).

Our Results. We provide a nominal anti-unification algorithm (AntiUnifLetr) for \(\texttt{NLL}_X\) which preserves the good properties of our semantic approach: it is terminating, sound, computes an exponential number of generalizations (Theorem 1) and weakly complete (Theorem 2). Completeness is achieved after further specialization of the computed generalization (Theorem 3).

The observation that garbage might be present in letrec expressions (for example, useless bindings in environments), and that they can be avoided by a semantically correct garbage collection algorithm, allows to apply the results and methods in [18], which shows that \(\alpha \)-equivalence and further algorithms could be considerably improved for garbage-free expressions. This leads to the design of AntiUnifNoGarbage, an anti-unification algorithm for ground garbage-free expressions, that is terminating, runs in polynomial time and produces one least general generalization, i.e. it is unitary (Theorem 4).

2 Preliminaries

We consider a countable infinite set of atoms \({\mathbb {A}}\) of (concrete) symbols ab which we usually denote in a meta-fashion; so we can use symbols ab also with indices (the variables in lambda-calculus). We also consider a set \(\mathcal{F}\) of function symbols with arity \( ar (\cdot )\), and a countably infinite set of expression-variables \({ Var}\) ranged over by XY. We will use mappings on atoms from \({\mathbb {A}}\): a swapping (a b) is a bijective function that maps atom a to atom b, atom b to a, and is the identity on other atoms. We will also use finite permutations \(\pi \) on atoms from \({\mathbb {A}}\), which consists of a composition of swappings: in fact, every finite permutation \(\pi \) can be represented by a composition of at most \((| dom (\pi )|- 1)\) swappings, where \( dom (\pi ) = \{a \in {\mathbb {A}} \mid \pi (a) \not = a\}\). The identity permutation is denoted Id. Composition \(\pi _1 \circ \pi _2\) and the inverse \(\pi ^{-1}\) can be immediately computed, where the complexity is polynomial in the size of \( dom (\pi )\).

Ground Expressions. The syntax of expressions \(\bar{e}\) of the (ground) language \(\texttt{NLL}\) with recursive let is:

$$ \begin{array}{lcl} \bar{e}&:\,\!:=&a \mid \lambda a.\bar{e} \mid (f~\bar{e}_1~\ldots ~\bar{e}_{ ar (f)}) \mid (\texttt{letr}~a_1.\bar{e}_1; \ldots ; a_n.\bar{e}_n~\texttt{in}~\bar{e}) \end{array} $$

Ground expressions are either atoms, abstractions of an atom in an expression, function application, or a letrec expression. We assume that binding atoms \(a_1,\ldots ,a_n\) in a letrec-expression \((\texttt{letr}~a_1.\bar{e}_1; \ldots ;\) \(a_n.\bar{e}_n~\texttt{in}~\bar{e})\) are pairwise distinct. Sequences of bindings \(a_1.\bar{e}_1;\ldots ; a_n.\bar{e}_n~\) may be abbreviated as \( env \) (environment). The scope of atom a in \(\lambda a.\bar{e}\) is standard: a has scope \(\bar{e}\). The \(\texttt{letr}\)-construct has a special scoping rule: in \((\texttt{letr}~a_1.\bar{e}_1; \ldots ;a_n.\bar{e}_n~\texttt{in}~\bar{e})\), every atom \(a_i\) that is free in some \(\bar{e}_j\) or \(\bar{e}\) is bound by the environment \(a_1.\bar{e}_1; \ldots ;a_n.\bar{e}_n\). This defines in \(\texttt{NLL}\) the notion of free atoms \( FA (\bar{e})\), bound atoms \( BA (\bar{e})\) in expression \(\bar{e}\), and all atoms \( AT (\bar{e})\) that occur in \(\bar{e}\). For an environment \( env = \{a_1.\bar{e}_1,\ldots ,a_n.\bar{e}_n\}\), we define the set of letrec-atoms as \( LA ( env ) = \{a_1,\ldots ,a_n\}\). We say a is fresh for \(\bar{e}\) iff \(a \not \in FA (\bar{e})\), denoted as \(a\#\bar{e}\).

Remark 1

The base language \(\texttt{NLL}\) is a lambda calculus extended with function constant and a recursive let constructor \(\texttt{letr}\), and can also be interpreted as an untyped fragment of Haskell [7]. The function application operator in functional languages (implicit in some languages) can be encoded by a binary function app, and the case-construct in its plain form can be encoded as an application.

Example 1

The letrec-expression \((\texttt{letr}~a. cons ~\bar{e}_1~b; b.cons~ \bar{e}_2~ a~\texttt{in}~a)\) represents an infinite list \((cons~\bar{e}_1~(cons~\bar{e}_2~(cons~\bar{e}_1 ~(cons~\bar{e}_2~\ldots ))))\), where \(\bar{e}_1,\bar{e}_2\) are expressions and cons is the usual list constructor taken as a function symbol.

Syntactic \(\alpha \)-equivalence on \(\texttt{NLL}\) is defined, following [16], as an extension of usual \(\alpha \)-equivalence, where in addition the expressions \((\texttt{letr}~a_1.\bar{e}_1; \ldots ; a_n.\bar{e}_n~\texttt{in}~\bar{e})\) and \((\texttt{letr}~a'_1.\bar{e}'_1; \ldots ; a'_n.\bar{e}'_n~\texttt{in}~\bar{e}')\) are \(\alpha \)-equivalent iff the expressions can be made equal by correctly renaming them, possibly reordering the environment.

Definition 1

The \(\alpha \)-equivalence \(\sim _\alpha \) on \(\bar{e} \in \texttt{NLL}\) is defined as follows:

  • \(a \sim _\alpha a\) for atoms a.

  • if \(\bar{e}_i \sim _\alpha \bar{e}_i'\) for all i, then \((f ~\bar{e}_1 \ldots \bar{e}_n) \sim _\alpha (f ~\bar{e}_1' \ldots \bar{e}_n')\) for n-ary \(f \in \mathcal{F}\).

  • If \(\bar{e} \sim _\alpha \bar{e}'\), then \(\lambda a.\bar{e} \sim _\alpha \lambda a.\bar{e}'\).

  • If \(a\#\bar{e}'\) and \(\bar{e} \sim _\alpha (a~b)\cdot \bar{e}'\), then \(\lambda a.\bar{e} \sim _\alpha \lambda b.\bar{e}'\).

  • \((\texttt{letr}~a_1.\bar{e}_1; \ldots ; a_n.\bar{e}_n~\texttt{in}~\bar{e}) \sim _\alpha (\texttt{letr}~a_{\rho (1)}.\bar{e}_{\rho (1)}; \ldots ;a_{\rho (n)}.\bar{e}_{\rho (n)} ~\texttt{in}~\bar{e})\) for any permutation \(\rho \) of \(\{1,\ldots ,n\}\).

  • The following holds for a permutation \(\pi \) on atoms \(\{a_1,\ldots , a_n\}\cup \{ a_1', \ldots , a_n'\}\):

    figure a

    where, for \(i = 1,\ldots ,n\): \(a_i\)’s are pairwise distinct, and \(a_i'\)’s are pairwise distinct.

Permutations operate on \(\texttt{NLL}\)-expressions by recursing on their structure. For example, \(\pi \cdot (\texttt{letr}~a_1.\bar{e}_1; \ldots ;a_n.\bar{e}_n~\texttt{in}~\bar{e}) =(\texttt{letr}~\pi \cdot a_1.\pi \cdot \bar{e}_1; \ldots ;\pi \cdot a_n.\pi \cdot \bar{e}_n~\texttt{in}~\pi \cdot \bar{e})\).

General Expressions. The syntax of the nominal higher-order language \(\texttt{NLL}_X\) with letrec and variables is:

$$ \begin{array}{lcl} e,s,t &{} :\,\!:= &{} a \mid \pi {\cdot }X \mid \lambda a.e \mid (f~e_1~\ldots ~e_{ ar (f)}) \mid (\texttt{letr}~a_1.e_1; \ldots ; a_n.e_n~\texttt{in}~e) \\ \pi &{}:=&{} \emptyset \mid (a~b){\cdot }\pi \end{array} $$

General expressions extend \(\texttt{NLL}\) with suspensions, i.e., expressions of the form \(\pi \cdot X\), which denotes a variable X (also called a generalization variable) in which a permutation is suspended: \(\pi \) is waiting for some instantiation of X before its action. The basic properties and functions of \(\texttt{NLL}\) such as \( FA (e)\), \( BA (e)\), scope, fresh, etc., extend to \(\texttt{NLL}_X\) as expected. In particular, \( AT (e)\) is extended to suspensions as \( AT (\pi \cdot X)=\{a \mid a\in dom (\pi )\}\). The suspension \( Id {\cdot }X\) is written simply as X. We define \({ Head}(s)\) either as the top function symbol in \(\{a,f,\lambda ,\texttt{letr}\}\) or \({ Head}(\pi \cdot X)\) as X. More generally, for a non-variable expression e, the expression \(\pi {\cdot }e\) means an operation, which is performed by shifting \(\pi \) into the expression, using the additional simplification \(\pi _1 {\cdot } (\pi _2 {\cdot } e) \rightarrow (\pi _1 \circ \pi _2) {\cdot } e\), where after the shift, \(\pi \) only remains in suspensions. For instance, \((a~c)\cdot (\texttt{letr}~ a.(\lambda b. X)~\texttt{in}~ f(a))\) denotes a renaming of a to c and vice-versa, which is equal to \((\texttt{letr}~ c.(\lambda b. (a~c)\cdot X)~\texttt{in}~ f(c))\).

An \(\texttt{NLL}_X\)-freshness constraint is an expression of the form \(a{{\#}}e\), expressing that a is not free in (or is fresh for) e, where e is an \(\texttt{NLL}_X\)-expression. A conjunction (or set) of freshness constraints is called freshness context which is written using the notation \(\nabla ,\varDelta \). Every \(\texttt{NLL}_X\)-freshness context can be transformed into a simpler one (flattened form) using the rules in Fig. 1 exhaustively until consisting only of constraints of the form \(a\#X\) or \(\bot \) (fail), which are called atomic. An \(\texttt{NLL}_X\)-freshness context \(\nabla \) is consistent if its flattened form does not contain \(\bot \). The definition of \(\alpha \)-equivalence extends to \(\texttt{NLL}_X\) as expected. In the following, \([s]_\alpha \) denotes the equivalence class of the expression s induced by the equivalence relation \(\sim _\alpha \).

Fig. 1.
figure 1

Simplification of freshness constraints in \(\texttt{NLL}_X\)

Lemma 1

Simplification using rules of Fig. 1 constitutes a polynomial decision algorithm for satisfiability of \(\nabla \): If \(\bot \) is in the result, then unsatisfiable; otherwise, satisfiable.

An \(\texttt{NLL}_X\)-substitution \(\rho \) is a finite mapping from generalization variables to \(\texttt{NLL}_X\)-expressions. Substitutions act on expressions homomorphically and this action extends to freshness constraints and contexts as follows: \((a\#X)\rho \text { iff } a\# X\rho \) and \(\nabla \rho =\{a\#e\rho \mid a\#e \in \nabla \}\). We will denote the domain of substitutions by \( dom (\cdot )\). A substitution is ground if it maps (generalization) variables to \(\texttt{NLL}\)-expressions. For a ground substitution \(\rho \): \(\nabla \rho \) is called valid iff \(\nabla \rho \) is consistent.

Permutations and Cycles. A cycle \(\tau \) in \(\mathbb {A}\) is a permutation represented by a sequence of different atoms \(a_1,a_2,\ldots ,a_n\), such that \(\tau (a_i) = a_{i+1}\) for \(i =1,\ldots ,n-1\) and \(\tau (a_n) = a_1\). As standard, such cycle will be denoted as \(\tau =(a_1\ a_2 \ \ldots a_n)\). Every permutation \(\pi \) has a representation \(\tau _1 \tau _2 \ldots \tau _n\) (which abbreviates \(\tau _1\circ \tau _2\circ \ldots \circ \tau _n\)) where \(\tau _i\) are disjoint (primitive) cycles.

The disjoint cycles can be permuted. For instance, the permutation \((a\ b) (b\ d) (c \ e)\) has the cycle presentation \((a\ b \ d) (c \ e)\) which is the same as \( (c \ e) (a\ b\ d)\).

2.1 Data-Structures of Anti-unification Algorithms

Anti-unification algorithms will produce as a result expressions that are restricted by a freshness context. These are called expressions-in-context and denoted as \((\nabla ,s)\), where \(\nabla \) is a freshness context and s is an \(\texttt{NLL}_X\)-expression.

The semantics of expressions-in-context follow the idea that syntactically used names of atoms in expressions are fixed, and atoms occurring in \(\nabla \), but not in s are viewed as existentially quantified: these are treated as arbitrary names of atoms.

Definition 2

An expression-in-context is a pair \((\nabla ,e)\), where e is an expression and \(\nabla \) is a (consistent) freshness context. The semantics of \((\nabla ,e)\) is the set of ground instances of e that satisfy \(\nabla \), i.e.,

$$\llbracket (\nabla ,e) \rrbracket =\{[r]_\alpha \mid \exists \hat{\rho }: \forall a\in AT (e).~a\hat{\rho }=a \text{ and } [r]_\alpha = [e\hat{\rho }]_\alpha \text { and } \nabla \hat{\rho } \text { valid}\}$$

where \(\hat{\rho }\) is a mapping from \({ Var}\cup \mathbb {A}\) to ground expressions such that \(\hat{\rho }|_{\mathbb {A}}\) is a bijection on atoms.

The existential quantification on valid instances of expressions gives additional power to the semantics of expressions-in-context: by considering a as existentially quantified, we obtain that \(\llbracket (\{a \# X\},X) \rrbracket \) is the same as \(\llbracket (\emptyset ,X) \rrbracket \).

Example 2

Consider the expression-in-context \((\{a\#X\},f(X))\). We will argue that \(\llbracket (\{a \# X\},f(X)) \rrbracket = \llbracket (\emptyset ,f(X)) \rrbracket \). First, notice that a does not occur syntactically in f(X) and therefore we can take \(\hat{\rho }\) mapping a to an arbitrary atom that does not break validity of \(\nabla \). In fact:

  • It is obvious that \(\llbracket (\{a\#X\},f(X)) \rrbracket \subseteq \llbracket (\emptyset , f(X)) \rrbracket \), since the left one has more restriction on its elements than the right one.

  • \(\llbracket (\emptyset , f(X)) \rrbracket \subseteq \llbracket (\{a\#X\},f(X)) \rrbracket \): Let \(\hat{\rho }\) be a bijection on atoms that is the identity on the atoms occurring in f(X) (there is none). Then, we select \(a\hat{\rho }\not \in [f(X)]_\alpha \) which trivially implies that \(a\hat{\rho }\#X\hat{\rho }\) holds.

Our semantics for \((\{a \# X\},X)\) differs from the one in Baumgartner et al. [3] where \(\llbracket (\{a \# X\},X) \rrbracket _B\) is the set of all ground instances of X, where a is not permitted to occur free. This will induce the negative effect of properly infinite descending chainsFootnote 1 of expressions-in-context such as \(\ldots \prec _B (\{a\#X,b\#X\},f(X))\prec _B (\{a\#X\},f(X))\prec _B (\emptyset ,f(X))\), which is eliminated in our approach since in all these expressions-in-context have the same semantics.

Next we define an order relation on expressions-in-context which establishes when one expression-in-context is more general or more specific than another.

Definition 3 (Ordering, Generalization)

  • An expression-in-context \((\varDelta ,r)\) is more specific (or less general) than an expression-in-context \((\nabla ,s)\), denoted \((\nabla ,s) \preceq (\varDelta , r)\), if \(\llbracket (\varDelta ,r) \rrbracket \subseteq \llbracket (\nabla ,s) \rrbracket \). The strict part of \(\preceq \) is denoted \(\prec \). This defines equivalence of two expressions-in-context via their semantics: \((\nabla ,s) \approx (\nabla ',t)\) iff \(\llbracket (\nabla ,s) \rrbracket =\llbracket (\nabla ',t) \rrbracket \).

  • An expression-in-context \((\varDelta ,r)\) is a generalization of \((\nabla ,s)\) and \((\nabla ',t)\), if \((\varDelta ,r) \preceq (\nabla ,s)\) and \((\varDelta ,r) \preceq (\nabla ',t)\).

  • A generalization \((\varDelta ',r')\) of \((\nabla ,s)\) and \((\nabla ',t)\) is the most specific (the least general) one, if for all generalizations \((\varDelta ,r)\) of \((\nabla ,s)\) and \((\nabla ',t)\), we have \((\varDelta ,r) \preceq (\varDelta ',r')\).

For instance, the expression-in-context \((\emptyset , \lambda e. app (e,X))\) is a generalization of \((\emptyset ,\lambda a. app(a,c))\) and \((\emptyset , \lambda b.app(b,Z))\), for a new atom e. It is easy to verify that \((\emptyset , \lambda e.app(e,X)) \preceq (\emptyset , \lambda a.app(a,c))\) and \((\emptyset , \lambda e.app(e,X))\preceq (\emptyset , \lambda b.app(b,Z))\).

3 The Anti-unification Problem for \(\texttt{NLL}_X\)

We are interested in the anti-unification problem for \(\texttt{NLL}_X\):

Given two expressions-in-context \((\nabla ,s)\) and \((\nabla ,t)\),

Find a least general generalization, i.e., another expression-in-context \((\varDelta ,r)\) that satisfies \((\varDelta ,r) \preceq (\nabla ,s)\) and \((\varDelta ,r) \preceq (\nabla ,t)\).

The challenge in treating letrec-expressions in anti-unification algorithms is, on the one hand, its unusual scoping and; on the other hand, the multiple possibilities to formulate the same problem in several syntactically different ways.

Remark 2

[Permutations in the generalization of suspensions]. Generalization of suspensions, say \((\emptyset ,\pi _1{\cdot }Z)\) and \( (\emptyset ,\pi _2{\cdot }Z)\), need some preparations based on properties of permutations: first, we decompose \(\pi _1\) and \(\pi _2\) into their cycle presentation, say \(\pi _1=\mu _1 \ldots \mu _n \) and \(\pi _2=\mu _1' \ldots \mu _m'\); second, we work on generalizing \((\emptyset ,\mu _1\ldots \mu _n \cdot Z)\) and \((\emptyset , \mu _1'\ldots \mu '_m{\cdot }Z)\) as follows: let \(\pi _3\) be a permutation obtained from the set of common cycles of \(\pi _1\) and \(\pi _2\), say \(\pi _1 = \pi _3\pi _1'\) and \(\pi _2 = \pi _3\pi _2'\). Then, \(\pi _3\cdot X\) is a generalization for \((\emptyset , \pi _1\cdot Z)\) and \((\emptyset ,\pi _2\cdot Z)\). In the following we will denote the common cycles of permutations \(\pi _1\) and \(\pi _2\) as \(\pi _1 \cap \pi _2\). This will be addressed in details with the specific rule for suspensions in Fig. 2.

3.1 The Algorithm AntiUnifLetr and Its Rules

We first define the nominal generalization algorithm AntiUnifLetr that (nondeterministically) computes a single generalization of the input expressions, where the generalization can also be nonlinear in the generalization variables due to merging. We will argue that the algorithm is sound and weakly complete, and one run can be performed in polynomial time.

The data structure of the algorithm AntiUnifLetr is \((\varGamma ,M,\nabla ,L)\) where:

  • \(\varGamma \) is a set of generalization triples of the form \(X : s \triangleq t\), where X is a fresh (generalization-) variable, and st are \(\texttt{NLL}_{X}\)-expressions;

  • M is a set of solved generalization triples;

  • \(\nabla \) is a set of freshness constraints, without freshness constraints for the fresh generalization variable for the input generalization triple;

  • L is a substitution represented as a set of bindings; the empty set is []. The result of applying the substitution L on the generalization variable X is denoted as \(X\circ L\).

We call such a tuple a state. The rules of the algorithm AntiUnifLetr, given in Fig. 2, operate on states and denotes disjoint union. Given two \(\texttt{NLL}\) expressions s and t, and a freshness context \(\varDelta \) (possibly empty), to compute generalizations for \((\varDelta , s)\) and \((\varDelta , t)\), we start with \((\{X:s\triangleq t\}; \emptyset ;\varDelta ; [] )\), the initial state (sometimes abbreviated to \((\varDelta , \{X:s\triangleq t\})\)), where X is a fresh generalization variable, and we apply the rules from Fig. 2 and Fig. 4 until no more rule applications are possible and we reach the final state which has the form \((\emptyset , M, \nabla ,L)\), where M must be completely merged. We will denote the computation from initial to a final state: \((\varGamma ; \emptyset ; \varDelta ; [] ) \Longrightarrow ^* (\emptyset ; M; \nabla , L)\).

The output is an expression-in-context obtained from the generated substitution L and the final freshness constraint \(\nabla \), i.e. the output is \((\nabla , X\circ L)\), also called the result computed by the AntiUnifLetr algorithm. We say it is complete if every least general generalization (lgg) is found and it is weakly complete if every lgg is found up to some set of freshness constraints.

Fig. 2.
figure 2

Rules of the algorithm AntiUnifLetr

Fig. 3.
figure 3

The permutation matching (sub-)algorithm Eqvm

Fig. 4.
figure 4

Rules for letrec of the algorithm AntiUnifLetr

Rules in Fig. 2 are similar to the ones in [3] without the parameter for the set of atoms occurring in the initial state and throughout the computation, and deal with abstractions, function application, and suspensions. The subalgorithm Eqvm, defined by the rules in Fig. 3, computes a matching permutation, say \(\pi \), of two expressions-in-context (say \(s\preceq t\) in \(\varPsi \) with context \(\nabla \)), where EqvBiEx(\(\varPi \)) checks whether the set of swappings is injective and then adds a minimal set of mappings such that the result is a bijection, i.e. a permutation (on atoms). Rules in Fig. 4 are new and will be described in detail:

  • Rule \((\texttt{Letraa})\) acts as a decomposition rule with the letr construct and can only be applied if the bindings in the environment are the same, respecting the given order.

  • Rule \((\texttt{Letrperm})\) is branching and exhaustively tries to generalize the expressions by considering all permutations of the letr environment.

  • Rule \((\texttt{Letrab})\) deals with renaming of bound names; it consistently swaps the binding atoms of the letr environment with fresh names and propagates the obtained permutation throughout both expressions.

The latter rule exploits the following idea: if \(\lambda a.s\) and \(\lambda b.t\) are \(\alpha \)-equivalent, then one can rename a and b with the same fresh name c and propagate the renaming within s and t and still obtain \(\alpha \)-equivalent expressions.

Example 3

A generalization for the expressions-in-context \((\emptyset , \texttt{letr}\ a.a; b.c ~\texttt{in}~ f(a,b))\) and \( (\emptyset , \texttt{letr}\ b.a; c. c\ \texttt{in}\ f(a,b))\) is computed as follows:

  1. 1.

    We cannot apply rule \((\texttt{Letraa})\) since the binding atoms in the environment are not corresponding to each other. We may rearrange the bindings using \((\texttt{Letperm})\). Then we apply rule Letrab for renaming: we choose de as fresh atoms and use the renaming (a d)(b e) and (c d) (b e), which leads to the check \(\nabla ' = \{d,e\#(\texttt{letr}~a.a;b.c~\texttt{in}~f(a,b))\} \cup \{d,e\#(\texttt{letr}~c.c;b.a~\texttt{in}~f(a,b))\}=\emptyset \) which holds and evaluates to \(\emptyset \), since the terms are ground. After an application \((\texttt{Letraa})\), which decomposes the letrec environments:

    figure c
  2. 2.

    After three applications of \((\texttt{Dec})\), one \((\texttt{Solve})\) and one \((\texttt{Mer})\) we obtain \(( \emptyset ,\{X_2:c\triangleq a\},\emptyset ,\{X\mapsto \texttt{letr}~ d.d;e.X_2~\texttt{in}~ f((c\ d)\cdot X_2,e)\})\). The output generalization is \((\emptyset , \texttt{letr}\ d.d; e.X_2\ \texttt{in}\ f((c\ d)\cdot X_2,e))\).

Another Solution: from \((X: \texttt{letr}\ a.a; b.c\ \texttt{in}\ f(a,b) \triangleq \texttt{letr}\ b.a; c. c\ \texttt{in}\ f(a,b))\) we could have immediately applied the rule \((\texttt{Letrab})\) using \(\pi _1=(a~d)(b~e)\) for the left and \(\pi _2=(b~d) (c~e)\) for the right expression. This finally leads to a generalization of the form \(\texttt{letr}~ d.X_1,e.X_2~\texttt{in}~f(X_3,X_4)\) which is “weaker” (too general) than the one above.

Note that the environments of one of the expressions to be generalized contains garbage: the binding c.c is not used in f(ab).

Theorem 1

The algorithm AntiUnifLetr is terminating and sound. A single run requires polynomial time. The overall computation requires exponential time and may compute an exponential number of generalizations.

Proof

Soundness and termination can be easily checked by inspection of the rules of Figs. 2, 4 and 3. The number of nondeterministic alternatives is exponential in the worst case, and it is induced by the rule \((\texttt{Letperm})\). A single run (one branch) can be performed in polynomial time.

Notice that except for rule \((\texttt{Letrab})\), all the rules in AntiUnifLetr algorithm preserve the context \(\nabla \). This differs from the approach taken in [3] which might add new freshness constraints with a rule similar to our rule \((\texttt{SolveYY})\), based on a set A of all atoms appearing throughout the computation of a generalization. We show in the next example that this choice of initially preserving the freshness context leads to a weak completeness result, but completeness is regained with a specialization algorithm that will be presented next.

Example 4 (Weak Completeness)

The expressions-in-context \((\emptyset , f(c_1,a))\) and \((\emptyset ,f(c_2,a))\) have the generalization \((\emptyset , f(X_1,a))\) computed by the rules of Fig. 2. However, this is not the lgg since \((\{a\#X_1\}, f(X_1,a))\) is a more specific generalization. In fact, \(f(a,a)\in \llbracket (\emptyset , f(X_1,a) \rrbracket \), but \(f(a,a)\notin \llbracket (\{a\#X_1\}, f(X_1,a) \rrbracket \).

Theorem 2 (Weak Completeness)

Given \(\texttt{NLL}_X\) expressions \(e\) and \(e'\), and a freshness context \(\varDelta \). If \((\nabla ', r)\) is a generalization of \((\varDelta ,e)\) and \((\varDelta ,e')\), then there exists a \(\nabla ''\) and a derivation \((\{X:e\triangleq e'\},\emptyset ,\varDelta ,[])\Longrightarrow ^* (\emptyset ,M,\nabla ,\sigma )\) such that \((\nabla \cup \nabla '',X\sigma )\) is a generalization of \((\varDelta ,e)\) and \((\varDelta ,e')\) and \((\nabla \cup \nabla '',X\sigma ) \preceq (\nabla ', r) \).

Proof

The proof is by induction on the structure of r.

Example 5

(Cont. Example 4). We remark another behaviour that can be seen from the execution of AntiUnifLetr: \((\{X{:}f(c_1,a)\triangleq f(c_2,a)\}, \emptyset ,\emptyset ,[])\) reduces to \( (\emptyset , \{X_1{:}c_1\triangleq c_2\},\emptyset ,\{X\mapsto f(X_1,a)\})\). Notice that (i) f(aa) is clearly not an element of \( \llbracket (\emptyset ,f(c_1,a)) \rrbracket \) nor \( \llbracket (\emptyset ,f(c_2,a)) \rrbracket \); (ii) the information that \(c_1\) and \(c_2\) were free names in the input problem was “forgotten” by the generalization \(f(X_1,a)\), but it can be retrieved from the solved triple in the final state. (iii) \(a\#c_1\) and \(a\#c_2\) hold trivially.

3.2 From Weak Completeness to Completeness

Given a result \((\nabla ,s)\) of a run of the algorithm AntiUnifLetr, the result is in general only weakly complete, since the expressivity of the language may permit a better generalization. The true most specific generalization may have additional freshness constraints, as it was shown in Example 4. The problem of specializing the generalizer output by AntiUnifLetr is subtle: a different but related behaviour can be seen with the next example.

Example 6

Consider the expressions-in-context \((\emptyset , f(g(c_1,a),a))\) and \((\emptyset , f(c_2,a))\) as input for AntiUnifLetr. The output generalization is \((\emptyset , f(X_1,a))\), and this is the lgg. In fact, a run of the algorithm would terminate with the final state \((\emptyset ,\{X_1{:}g(c_1,a)\triangleq c_2\},\emptyset ,\{X\mapsto f(X_1,a)\})\).

We can use the information in the solved part of the final state to build the substitutions \(\sigma _1=\{X_1\mapsto g(c_1,a)\}\) and \(\sigma _2=\{X_1\mapsto c_2\}\) that instantiate the generalization \(f(X_1,a)\) back to the input terms. Notice that \(a\#X_1\sigma _1\) is equal to \(a\#g(c_1,a)\) and does not hold. Thus, we cannot add \(\{a\#X_1\}\) as a constraint to the generalization, since \((\{a\#X_1\}, f(X_1,a))\) cannot be instantiated to \(f(g(c_1,a),a)\).

Let \(\gamma =(\emptyset ;M;\nabla ;L)\) be a final state. We define \(AT_f(\gamma )\) as the set of unbound atoms that occur in \(M,\nabla \) or codom(L). We say that a generalization variable X occurs in \(\gamma \) when it occurs in \(\nabla \), or as a subterm in M, or in codom(L).

Definition 4 (Relevant Atoms)

Let \(\gamma =(\emptyset ;M;\nabla ;L)\) be a final state in a run of AntiUnifLetr. Let X be a generalization variable occurring in \(\gamma \). The set of relevant atoms for X, denoted \({RelAtoms}_\gamma (X)\), is defined recursively:

  • If there is no solved triple for X in M. Then, the relevant atoms are \(RelAtoms_\gamma (X)=AT_f(\gamma ){\setminus }\{a\mid a\#X \in \nabla \}\), i.e., all atoms that are not bound and that occur syntactically in the state, but not the atoms that were excluded due to the freshness constraints in \(\nabla \).

  • If there is a solved triple \(X:s\triangleq t\in M\). Then, \( RelAtoms _\gamma (X) = RelAtoms _\gamma (s) \cup RelAtoms _\gamma (t)\). The other cases are defined recursively in the structure of the expression:

    • \( RelAtoms _\gamma (a) = a\), \( RelAtoms _\gamma (f~s_1 \ldots s_n) = \bigcup _i RelAtoms _\gamma (s_i)\);

    • \( RelAtoms _\gamma (\pi {\cdot }s)= \pi {\cdot } RelAtoms _\gamma (s)\);

    • \( RelAtoms _\gamma (\lambda a.s) = RelAtoms _\gamma (s) {\setminus } \{a\}\); and

    • \(RelAtoms_\gamma (\texttt{letr}~ a_1.s_1;\ldots ;a_n.s_n~\texttt{in}~ r)=RelAtoms_\gamma (s_1,\ldots , s_n,r) {\setminus } \{a_1,\ldots ,a_n\}\).

For example, if we take \(M=\{X{:}f(a,b) \triangleq g((a\ c){\cdot }Y),Y{:}f(c,d) \triangleq g(e)\}\) and \(\nabla =\{a \# Y\}\), then the set of relevant atoms for Y is \(\{c,d,e\}\), and for X it is \(\{a,b\} \cup (a~c)\{c,d,e\} = \{a,b,d,e\}\), where it is noteworthy that atom c is missing.

We formulate a postprocessing algorithm (Algorithm 1) for AntiUnifLetr which is able to compute least general generalizations.

figure d

Theorem 3

Adding (Algorithm 1) makes AntiUnifLetr complete.

Note, however, that due to the non-determinism, it may be possible that one of the runs generates a generalization that is strictly less specific than the result in another run, see Example 3.

Example 7

This example shows the result of generalizing more complex expressions. Consider the generalization problem, and the sequence of generalization steps, where the last step abbreviates several steps.

figure e

Now the resulting lgg can be computed by adding only one freshness constraint: \((\{g\# X_3\}, \lambda e. f(e,X_3,c))\). This holds, since \(d\in { RelAtoms _\gamma (X_3)}\), and hence does not occur in the freshness context. Notice that \(c\#X_3\) is added as a freshness constraint since c occurs in the generalization expression, but \(c\notin RelAtoms_\gamma (X_3)\).

4 Generalization Algorithm Under Semantic Equalities

We use semantic equivalences to specialize and extend our anti-unification algorithm to ground expressions. In particular, we exploit the fact that removal of garbage is semantically correct: it does not alter the meaning of the program. First, we develop a standardization algorithm for garbage-free expressions that helps in comparing the letrec-expressions and computing generalizations in polynomial time. Second, we propose a variation of our anti-unification algorithm called AntiUnifNoGarbage.

NLL-expressions may contain irrelevant bindings in the letrec environment: for instance, in \((\texttt{letr}~a.\texttt {Nil}; b.b ~\texttt{in}~f(a,a))\), the binding b.b is useless for the expression, and will be considered as garbage. The garbage bindings do not contribute to the meaning of the functional expressions. It is shown in [18], that \(\alpha \)-equivalence of garbage-free letrec-expressions can be checked in polynomial time, and that, in general, this problem is group-isomorphism-complete [2, 20].

Definition 5

Let \(\bar{e}\) be an \(\texttt{NLL}\)-expression. We say that \(\bar{e}\) contains garbage iff there is a subexpression \((\texttt{letr}~a_1.\bar{e}_1,\ldots ,a_n.\bar{e}_n~\texttt{in}~\bar{e}')\) in \(\bar{e}\) such that the environment \(a_1.\bar{e}_1,\ldots ,a_n.\bar{e}_n\) can be split into two nonempty sub-environments \(a_{i_1}.\bar{e}_{i_1},\ldots ,a_{i_k}.\bar{e}_{i_k}\) and \(a_{j_1}.\bar{e}_{j_1},\ldots ,a_{j_{k'}}.\bar{e}_{j_{k'}}\), and the binding atoms \(a_{i_h}, h = i_1,\ldots ,i_k\) do not occur free in \(\texttt{letr}~ a_{j_1}.e_{j_1},\ldots ,a_{j_k}.\bar{e}_{j_k}~\texttt{in}~\bar{e}'\). We say that \(\bar{e}\) is garbage-free (or garbage-collected) iff it does not contain garbage.

Making an expression garbage-free may require an iterated removal of garbage, using the garbage removal rewriting rules below:

$$\begin{aligned} (\textbf{gr}1)~&\texttt{letr}~ a_1.e_1;\ldots ;a_n.e_n;b_1.e'_1;\ldots ; b_m.e'_m ~\texttt{in}~ e'_{m+1} \longrightarrow \\&\texttt{letr}~ b_1.e'_1;\ldots ; b_m;e'_m ~\texttt{in}~ e'_{m+1}, \text { if } \bigcup FA (e'_i)\cap \{a_1,\ldots , a_n\}=\emptyset \\ (\textbf{gr}2)~&\texttt{letr}~ a_1.e_1;\ldots ;a_n.e_n~\texttt{in}~ e \longrightarrow e, \text { if } FA (e)\cap \{a_1,\ldots , a_n\}=\emptyset \end{aligned}$$

We illustrate our ideas for the generalization of garbage-free expressions. Note that the used equality of expressions makes a notable difference for the results as well as for the algorithmic steps.

Example 8

Let \(\bar{s} = \texttt{let}~c.a~\texttt{in}~f(g(c))\) and \(\bar{t} = \texttt{let}~d.b~\texttt{in}~f(h(d))\) two garbage-free ground expressions. A generalization of s and t w.r.t. \(\sim _\alpha \) is \(\bar{s'} = \texttt{let}~ c.X_1~\texttt{in}~f(X_2)\), which is also an lgg. If we would allow more equalities on the expressions, like \(\sim _{gc}\) as a part of the equality or even an equality \(\sim _{\alpha ,gc,letcp}\) that allows also copying let-bindings, then \(\bar{s}\) would be equivalent to f(g(a)) and \(\bar{t}\) equivalent to f(h(b), which have f(X) as a generalization. The generalisation algorithm, however, would be much more complex.

The next step is to standardize the sequence of bindings in garbage-collected expressions, which greatly supports further operations.

Standardization Algorithm. Consider \(\texttt{let}~a_1.\bar{e}_1;\ldots ; a_n.\bar{e}_n~\texttt{in}~\bar{e}\) be a garbage-free \(\texttt{NLL}\)-expression. Then, rearrange the bindings as follows:

  1. 1.

    Let \(a_j\) be the atom from \(\{a_1,\ldots ,a_n\}\) that has the earliest occurrence as a free atom in the expression \(\bar{e}\), in its printed string. Then select \(a_j.\bar{e}_j\) as the leftmost binding in the fresh environment, i.e. \(r_0 = \bar{e}\); \(r_1 = \texttt{letr}~a_j.\bar{e}_j~\texttt{in}~\bar{e}\).

  2. 2.

    Iterate this to compute \(r_k\) from \(r_{k-1}=\texttt{letr}~env_{k-1}~\texttt{in}~\bar{e}\) by selecting among the remaining binding atoms \(a_{j'}\in \{a_1,\ldots , a_n\}{\setminus } \{a_j\}\) again the one which first occurs free in the printed string of \(r_{k-1}\), and then add \(a_{j'}.\bar{e}_{j'}\) as the leftmost binding in the letr-environment obtaining \(r_k=\texttt{letr}~ a_{j'}.\bar{e}_{j'};env_{k-1}~\texttt{in}~\bar{e}\).

These steps are to be used iteratively: apply them to the smallest subexpression \(\bar{e}'\) of \(\bar{e}\), which is not yet correctly arranged. The result is a gc-standardized expression \(t_{gcst}\) of t.

Example 9

Consider the garbage-free expression \(\texttt{let}~ a.app(b,\lambda c.c);b.\lambda d.d~\texttt{in}~ a\), where app is a binary function symbol for denoting the usual application of the lambda calculus. The standardization algorithm returns the gc-standardized expression \(\texttt{let}~ b.\lambda d.d; a.app(b,\lambda c.c)~\texttt{in}~ a\).

Proposition 1

For every garbage-free \(\texttt{NLL}\)-expression \(\bar{e}\), the gc-standardized expression \(\bar{e}'\) of \(\bar{e}\) with \(\bar{e} \sim _\alpha \bar{e}'\), has a sequence of bindings in all letrec environments that is unique and has a fixed ordering. The computation can be done in polynomial time.

Proof

Garbage collection is polynomial: after every step the expression will be smaller, and a single step of detecting a set of redundant bindings is also polynomial. The rearrangement also can be done first for subexpressions of smaller size, and a single rearrangement of the top binding takes polynomial time.

Fig. 5.
figure 5

Different lengths of letrec-environments in AntiUnifLetr

4.1 Anti-unification of Garbage-Free Expressions

In this and the next subsection on generalization we will use a syntactically fixed ordering of bindings in a let environments, and denote this as \(\texttt{letf}\).

AntiUnifLetr is adapted to the ground situation in several aspects: (i) There are no freshness constraints; (ii) expressions are first gc-standardized; (iii) we permit that \(n \ge 2\) expressions are to be generalized in one step; (iv) in a set of expressions to be generalized, we make all top-level letrec environments to be of the same (minimal) length by adding bindings a.a with fresh atoms a; and (v) we fix the sequence of bindings in a \(\texttt{let}\) indicated by \(\texttt{letf}\).

We remark that an iterated generalization of pairs (i.e., to generalize \(s_1,s_2\) and \(s_3\) one first generalizes \(s_1\) and \(s_2\), and from the result, say r, one repeat the generalization process with r and \(s_3\)) has the disadvantage that from the second step, after the first application of rule, there are generalization variables, and the semantic properties get lost, which means that, e.g., the standardization is no longer usable, and so the method does no longer work properly in the next generalization steps.

Therefore, for generalizing more than 2 expressions, the data structure adopted is: the generalized state is as \((\{X{:} s_1 \triangleq \ldots \triangleq s_n\};M;\nabla ;L)\), and we use generalization tuples of the form \(\{X{:} s_1 \triangleq \ldots \triangleq s_n\}\) to denote that X is a variable generalizing expressions \(s_1,\dots , s_n\). Examples for the modified rules are

figure f

Thus, we adapt the rules of AntiUnifLetr: it accepts \(n \ge 2\) ground expressions; the permutation-rule \((\texttt{Letrperm})\) is inactive due to fixing the ordering of bindings; merging is supported, and the subalgorithms Eqvm and EqvBiEx are almost trivial and applied to larger tuples. Also the sequence of bindings in lets is fixed. All these adaptations can be done within the polynomial complexity.

These explanations suggest the algorithm AntiUnifNoGarbage, for \(n \ge 2\) (ground) arguments, operating on a triple: \((\varGamma ,M,L)\). It is defined non-deterministically, but only one run will be done.

Example 10 (Fixed letr bindings)

Generalizing the garbage-collected expressions \(\texttt{let}~a'.a; b'.b; c'.c\) \(\texttt{in}\) \(f(g(a',b',c'))\) and \(\texttt{let}~ a'.b; b'.c; c'.a ~\texttt{in}~f(h(a',b',c'))\) produces \(\texttt{let}~ a'.a; b'.b; c'.c ~\texttt{in}~ f(X)\) since bindings can be rearranged, which requires exponential complexity for trying rearrangements. If we fix the sequence of bindings and generalize, then the algorithm requires only polynomial time in this step, then for \(\texttt{letf}~ a'.a; b'.b; c'.c~\texttt{in}~f(g(a',b',c'))\) and \(\texttt{letf}~ a'.b; b'.c; c'.a ~\texttt{in}~f(h(a',b',c'))\), we obtain \(\texttt{letf}~ a'.X_1; b'.X_2; c'.X_3 ~\texttt{in}~ f(X)\).

Theorem 4

Algorithm AntiUnifNoGarbage is sound, terminating and complete. It will compute a single least general generalization in polynomial time.

Proof

(Sketch). The main argument is that if no rule applies, then the result is already a generalization. Second, every applied rule keeps the semantics, i.e., does not lose information. The complexity has two components: one is the preparation of the input, which is polynomial. The second part is the test and computation of every rule, which is polynomial since there are no \(\nabla \)-sets, and the execution of every rule requires polynomial time in the input size. Moreover, the size of the problem is decreased in every step.

4.2 Exploiting Semantic Equalities

Since we focus application of the algorithms in (functional) higher-order programming languages, it makes sense to take more semantic equations and properties into account to recognize semantic equality of syntactically different expressions, which improves the power of generalization algorithms.

Since there are various approaches and definitions to semantics, like variants of contextual equivalences or bisimulations [9, 14, 15, 19] and we want to be consistent with most of them, we only investigate the equalities that are correct in a majority of the cases. By “cases” we mean different programming languages permitting \(\texttt{letr}\), but with different operational and equational semantics.

The following semantically correct equalities, expressed as rewrite rules, in languages with letrec could also be used for further standardization of expressions, where we assume that there are no conflicts with variable names.

  1. 1.

    \(x.f(s_1,\ldots ,s_n) \rightarrow x.f(y_1,\ldots ,y_n); y_1.s_1;\ldots ;y_n.s_n\)

  2. 2.

    \(\texttt{let}~(x = \texttt{letr}~ env ~\texttt{in}~r); env '~\texttt{in}~ s \rightarrow \) \(\texttt{let}~x = r; env ; env ' ~\texttt{in}~ s\)

  3. 3.

    \(\texttt{let}~ env ~\texttt{in}~ (\texttt{let}~ env ' ~\texttt{in}~ s) \rightarrow \) \(\texttt{let}~ env ; env ' \texttt{in}~ s\).

  4. 4.

    \(f~(\texttt{let}~ env ~ \texttt{in}~ s_1) ~s_2\) \(\rightarrow \) \(\texttt{let}~ env ~ \texttt{in}~ (f~s_1 ~s_2)\).

Note that these equalities if used to standardize expressions keep the polynomial complexity of generalizations of ground expressions.

5 Conclusion and Future Work

We formulated an anti-unification algorithm for expressions in a functional higher-order language with a let constructor that has mutually recursive bindings. We constructed a weakly complete anti-unification algorithm that in the general case is finitary, which is improved to being complete by a post-processing. In the worst case, the time for the computation as well as the number of generalizations are exponential.

In case the expressions are specialized to be ground and garbage-free, then the problem becomes unitary and the computation is polynomial. These properties make the method more friendly to applications. We also considered modifications of the generalization algorithm for functions in functional programming languages with letr that has a wider coverage by abstracting from the syntactical details and by observing semantic equalities.

Further work is to generalize algorithms to other patterns and to experiment with the generalization method in practice.