Keywords

1 Introduction

This work describes a new version (0.10.0, released 2023.04.25) of a previously published [28] Python package — gym-saturationFootnote 1: a collection of OpenAI Gym [6] environments for guiding saturation-style provers (using the given clause algorithm) with reinforcement learning (RL) algorithms. The new version partly implements the ideas of our project proposal [29]. The main changes from the previous release (0.2.9, on 2022.02.26) are:

  • guiding two popular provers instead of a single experimental one (Sect. 3)

  • pluggable first-order logic formulae embeddings support (Sect. 4)

  • examples of experiments with different RL algorithms (Sect. 5)

  • following the updated Gymnasium [35] API instead of the outdated OpenAI Gym

gym-saturation works with Python 3.8+. One can install it by pip install gym-saturation or conda install -c conda-forge gym-saturation. Then, provided Vampire and/or iProver binaries are on PATH, one can use it as any other Gymnasium environment:

figure a

2 Related Work

Guiding provers with RL is a hot topic. Recent projects in this domain include TRAIL (Trial Reasoner for AI that Learns) [2], FLoP (Finding Longer Proofs) [37], and lazyCoP [26]. We will now compare the new gym-saturation features with these three projects.

Usually, one guides either a new prover created for that purpose (lazyCoP; FLoP builds on fCoP [14], an OCaml rewrite of older leanCoP [19]) or an experimental patched version of an existing one (TRAIL relies on a modified E [27]). Contrary to that, gym-saturation works with unmodified stable versions of Vampire [15] and iProver [10].

In addition, known RL-guiding projects are prover-dependent: FLoP could, in principle, work with both fCoP and leanCoP but reported only fCoP experiments. TRAIL claims to be reasoner-agnostic, but to our best knowledge, no one has tried it with anything but a patched E version it uses by default. [26] mentions an anonymous reviewer’s suggestion to create a standalone tool for other existing systems, but we are not aware of further development in this direction. Quite the contrary, we have tested gym-saturation compatibility with two different provers (Vampire and iProver).

Deep learning models expect their input to be real-valued tensors and not, for example, character strings in the TPTP [32] language. Thus, one always uses a representation (or embeddings) — a function mapping a (parsed) logic formula to a real vector. In lazyCoP and FLoP parts of embedding functions belong to the underlying provers, making it harder to vary and experiment with (e.g., one needs Rust or OCaml programming skills to do it). gym-saturation leaves the choice of representation open and supports any mapping from TPTP-formatted string to real vectors. The version described in this work also provides a couple of default options.

3 Architecture and Implementation Details

3.1 Architecture

gym-saturation is compatible with Gymnasium [35], a maintained fork of now-outdated OpenAI Gym standard of RL-environments, and passes all required environment checks. As a result of our migration to Gymnasium, its maintainers featured gym-saturation in a curated list of third-party environmentsFootnote 2.

Previously, gym-saturation guided an experimental pure Python prover [28] which happened to be too slow and abandoned in favour of existing highly efficient provers: Vampire and iProver.

Although the gym-saturation user communicates with both iProver and Vampire in the same manner, under the hood, they use different protocols. For Vampire, we relied on the so-called manual (interactive) clause selection mode implemented several years ago for an unrelated task [11]. In this mode, Vampire interrupts the saturation loop and listens to standard input for a number of a given clause instead of applying heuristics. Independent of this mode, Vampire writes (or not, depending on the option show_all) newly inferred clauses to its standard output. Using Python package pexpect, we attach to Vampire’s standard input and output, pass the action chosen by the agent to the former and read observations from the latter. In manual clause selection mode, Vampire works like a server awaiting a request with an action to which it replies (exactly what an environment typically does).

iProver recently added support of being guided by external agents. An agent has to be a TCP server satisfying a particular API specification. So, iProver behaves as a client which sends a request with observations to some server and awaits a reply containing an action. To make it work with gym-saturation, we implemented a relay server. It accepts a long-running TCP connection from a running iProver thread and stores its requests to a thread-safe queue, and pops a response to it from another such queue filled by gym-saturation thread. See Fig. 1 for a communication scheme.

Fig. 1.
figure 1

gym-saturation interacting with iProver

3.2 Implementation Details

Clause Class. A clause is a Python data class having the following keys and respective values:

  • literals — a string of clause literals in the TPTP format, e.g. ‘member(X0,bb) | member(X0,b)’

  • label — a string label of a clause, e.g. ‘21’. Some provers (e.g. Vampire) use integer numbers for labelling clauses, but others (e.g. iProver) use an alphanumeric mixture (e.g. ‘c_54’)

  • role — a string description of a clause role in a proof (hypothesis, negated conjecture, axiom, et cetera)

  • inference_rule — a string name of an inference rule used to produce the clause. It includes not only resolution and superposition but also values like ‘axiom’ and ‘input’ (for theorem assumptions)

  • inference_parents — a tuple of clause labels if needed by the inference rule (‘axiom’ doesn’t need any, ‘factoring’ expects only one, ‘resolution’ — two, et cetera)

  • birth_step — an integer step number when the clause appeared in the proof state. Axioms, assumptions, and the negated conjecture have birth step zero.

All these fields except the birth_step (computed by the environment itself) are already available as separate entities (and not parts of TPTP-formatted strings) in iProver and Vampire output.

Environment Class  

Observation is a Python dictionary with several keys:

  • real_obs is a tuple of all clauses (processed and unprocessed). It can be transformed to tensor representation by so-called observation wrappersFootnote 3. The gym-saturation provides several such wrappers for cases of external embeddings service or hand-coded feature extraction function

  • action_mask is a numpy [13] array of the size max_clauses (a parameter which one can set during the environment object instantiation) having a value 1.0 at index i if and only if a clause with a zero-based order number i currently exists and can be a given clause (e.g. not eliminated as redundant). All other values of action_mask are zeros. This array simplifies tensor operations on observation representations.

Limiting the total number of clauses in a proof state is a proxy of both random-access memory (each clause needs storage space) and time (a prover has to process each clause encountered) limits typical for the CASC [33] competition. One can add a standard Gymnasium time-limit wrapper to limit the number of steps in an episode. Setting wall-clock time and RAM limits is not typical for RL research.

Action is a zero-based order number of a clause from real_obs. If a respective action_mask is zero, an environment throws an exception during the execution of the step method.

Reward is 1.0 after a step if we found the refutation at this step and 0.0 otherwise. One can change this behaviour by either Gymnasium reward wrappers or by collecting trajectories in a local buffer and postprocessing them before feeding the trainer.

Episode is terminated when an empty clause $false appears in the proof state or if there are no more available actions.

Episode is truncated when there are more than max_clauses clauses in the proof state. Since the state is an (extendable) tuple, we don’t raise an exception when a prover generates a few more clauses.

Info dictionary is always empty at every step by default.

Render modes of the environment include two standard ones (‘human’ and ‘ansi’), the first one printing and the second one returning the same TPTP formatted string.

Multi-task Environment. The latest gym-saturation follows a Meta-World benchmark [36] style and defines set_task method with one argument — a TPTP problem full path. If one resets an environment without explicitly setting a task in advance, the environment defaults to a simple group theory problem (any idempotent element equals the identity). Having a default task helps us keep compatibility with algorithms not aware of multi-task RL. One can inherit from gym-saturation environment classes to set a random problem at every reset or implement any other desirable behaviour.

4 Representation Subsystem

4.1 Existing First-Order Formulae Representations and Related Projects

As mentioned in Sect. 2, to apply any deep reinforcement learning algorithm, one needs a representation of the environment state in a tensor form first. There are many known feature engineering procedures. It can be as simple as clause age and weight [25], or information extracted from a clause syntax tree [18] or an inference lineage of a clause [30]. Representing logic formulae as such is an active research domain: for example, in [23], the authors proposed more than a dozen different embedding techniques based on formulae syntax. In communities other than automated deduction, researchers also study first-order formulae representation: for example, in [5], the authors use semantics representation rather than syntax. One can also notice that first-order logic (FOL) is nothing more than a formal language, so abstract syntax trees of FOL are not, in principle, that different from those of programming language statements. And of course, encoding models for programming languages (like code2vec [4] for Java) exist, as well as commercially available solutions as GPT-3 [7] generic code embeddings and comparable free models like LLaMA [34].

To make the first step in this direction, we took advantage of existing pre-trained embedding models for programming languages and tried to apply them to a seemingly disconnected domain of automated provers.

4.2 ast2vec and Our Contributions to It

In [20], the authors proposed a particular neural network architecture they called Recursive Tree Grammar Autoencoders (RTG-AE), which encodes abstract syntax trees produced by a programming language parser into real vectors. Being interested in education applications, they also published the pre-trained model for Python [21]. To make use of it for our purpose, we furnished several technical improvements to their code (our contribution is freely availableFootnote 4):

  • a TorchServe [24] handler for HTTP POST requests for embeddings

  • request caching with the Memcached server [9]

  • Docker container to start the whole subsystem easily on any operating system

Fig. 2.
figure 2

gym-saturation communication with ast2vec

To integrate the ast2vec server with gym-saturation environments, we added Gymnasium observation wrappers, one of them mapping a clause in the TPTP language to a boolean-valued statement in Python (in particular, by replacing logic operation symbols, e.g. = in TPTP becomes == in Python). See Fig. 2 for a communication diagram. In principle, since a clause doesn’t contain any quantifiers explicitly, one can rewrite it as a boolean-valued expression in many programming languages for which pre-trained embeddings might exist.

4.3 Latency Considerations

Looking at Fig. 2, one might wonder how efficient is such an architecture. The average response time observed in our experiments was 2 ms (with a 150 ms maximum). A typical natural language processing model which embeds whole texts has a latency from 40 ms to more than 600 ms [17] (depending on the model complexity and the length of a text to embed) when run on CPU, so there is no reason to believe that ast2vec is too slow. When evaluating a prover, one usually fixes the time limit: for example, 60 s is the default value for Vampire. Being written in C++ and with a cornucopia of optimisation tweaks, Vampire can generate around a million clauses during this relatively short timeframe. Thus, to be on par with Vampire, a representation service must have latency around 60 \(\upmu \)s (orders of magnitude faster than we have). There can be several ways to lower the latency:

  • inference in batches (one should train the embedding model to do it; ast2vec doesn’t do it out of the box). The improvement may vary

  • use GPU. NVIDIA reports around 20x improvement vs CPU [16]. However, throwing more GPUs won’t be as efficient without batch inference from the previous point

  • request an embedding for a binary object of an already parsed clause instead of a TPTP string. It means not repeating parsing already done by a prover, which might lower the latency substantially. To do this, one will have to patch an underlying prover to return binary objects instead of TPTP strings

  • use RPC (remote procedure call) instead of REST protocol. TorchServe relies on REST and parcels in JSON format, and in gRPC [12], they prefer the binary protobuf format. One rarely expects sub-millisecond latency from REST, although for RPC, 150 \(\upmu \)s is not unusual. This point doesn’t make much sense without the previous one

5 Usage Examples

We provide examples of experiments easily possible with gym-saturation as a supplementary code to this paperFootnote 5. We don’t consider these experiments as being of any scientific significance per se, serving merely as illustrations and basic usage examples. Tweaking the RL algorithms’ meta-parameters and deep neural network architectures is out of the scope of the present system description.

We coded these experiments in the Ray framework, which includes an RLlib — a library of popular RL algorithms. The Ray is compatible with Tensorflow [1] and PyTorch [22] deep learning frameworks, so it doesn’t limit a potential gym-saturation user by one.

In the experiments, we try to solve SET001-1 from the TPTP with max_clauses=20 (having no more than twenty clauses in the proof state) for guiding Vampire and max_clauses=15 for iProver. This difference is because even a random agent communicating to iProver manages to always solve SET001-1 by generating no more than twenty clauses. We wanted training to start, but keep the examples as simple as possible, so we chose to harden the constraints instead of moving on to a more complicated problem.

In one experiment, we organise clauses in two priority queues (by age and weight) and use an action wrapper to map from a queue number (0 or 1) to the clause number. That means we don’t implant these queues inside provers but follow a Gymnasium idiomatic way to extend environments. Of course, Vampire and iProver have these particular queues as part of their implementation, but our illustration shows one could use any other priorities instead. It transforms our environment into a semblance of a 2-armed bandit, and we use Thompson sampling [3] to train. This experiment reflects ideas similar to those described in [31].

In another experiment, we use ast2vec server for getting clause embeddings and train a Proximal Policy Optimisation (PPO) algorithm as implemented in the Ray RLlib. The default policy network there is a fully connected one, and we used \(256\times 20\) tensors as its input (256 is an embedding size in ast2vec, and 20 is the maximal number of clauses we embed). So, the policy chooses a given clause given the embeddings of all clauses seen up to the current step (including those already chosen or judged to be redundant/subsumed). Such an approach is more similar to [37].

Fig. 3.
figure 3

Episode reward mean vs the total number of steps. The blue line is for a random agent and the orange one — for the PPO. Both agents guide Vampire (Color figure online)

We provide Fig. 3 as a typical training process chart.

6 Conclusion and Future Work

We contributed a new version of gym-saturation, which continued to be free and open-source software, easy to install and use while promising assistance in setting up experiments for RL research in the automated provers domain. In the new version, we enabled anyone interested to conduct experiments with RL algorithms independently of an underlying prover implementation. We also added the possibility of varying representations as external plug-ins for further experimentation. We hope that researchers having such an instrument can focus on more advanced questions, namely how to generate and prioritise training problems to better transfer search patterns learned on simpler theorems to harder ones.

Our experience with adding Vampire and iProver support to gym-saturation shows that working tightly with corresponding prover developers is not mandatory, although it might help immensely. Implementing the prover guidance through the standard I/O (as in Vampire) seems to be relatively easy, and we hope more provers will add similar functionality in future to be more ML-friendly. Such provers could then profit from using any other external guidance (see [8] for a different system using the same iProver technical features as we did).

We identify a discerning and computationally efficient representation service as a bottleneck for our approach and envision an upcoming project of creating a universal first-order logic embedding model usable not only by saturation-style provers but also tableaux-based ones, SMT-solvers, semantic reasoners, and beyond.