LTL Learning on GPUs

Valizadeh, Mojtaba; Fijalkow, Nathanaël; Berger, Martin

doi:10.1007/978-3-031-65633-0_10

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14683))

Included in the following conference series:

International Conference on Computer Aided Verification

Abstract

Linear temporal logic (LTL) is widely used in industrial verification. LTL formulae can be learned from traces. Scaling LTL formula learning is an open problem. We implement the first GPU-based LTL learner using a novel form of enumerative program synthesis. The learner is sound and complete. Our benchmarks indicate that it handles traces at least 2048 times more numerous, and on average at least 46 times faster than existing state-of-the-art learners. This is achieved with, among others, a branch-free implementation of LTL that has $O(\log n)$ time complexity, where n is trace length, while previous implementations are $O(n^2)$ or worse (assuming bitwise boolean operations and shifts by powers of 2 have unit costs—a realistic assumption on modern processors).

All code and benchmarks are available at [1].

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Program verification means demonstrating that an implementation exhibits the behaviour required by a specification. But where do specifications come from? Handcrafting specifications does not scale. One solution is automatically to learn them from example runs of a system. This is sometimes referred to as trace analysis. A trace, in this context, is a sequence of events or states captured during the execution of a system. Once captured, traces are often converted into a form more suitable for further processing, such as finite state automata or logical formulae. Converting traces into logical formulae can be done with program synthesis. Program synthesis is an umbrella term for the algorithmic generation of programs (and similar formal objects, like logical formulae) from specifications, see [9, 16] for an overview. Arguably, the most popular logic for representing traces is linear temporal logic (LTL) [30], a modal logic for specifying properties of finite or infinite traces. The LTL learning problem idealises the algorithmic essence of learning specifications from example traces, and is given as follows.

Input: Two sets P and N of traces over a fixed alphabet.
Output: An LTL formula $\phi $ that is (i) sound: all traces in P are accepted by $\phi $, all traces in N are rejected by $\phi $; (ii) minimal, meaning no strictly smaller sound formula exists.

When we weaken minimality to minimality-up-to-$\epsilon $, we speak of approximate LTL learning. Both forms of LTL learning are NP-hard [11, 27]. A different and simpler problem is noisy LTL learning, which is permitted to learn unsound formulae, albeit only up-to a give error-rate.

LTL learning is an active research area in software engineering, formal methods, and artificial intelligence [2, 5,6,7, 12,13,14,15, 19, 20, 23,24,26, 28, 29, 32, 34,35,36]. We refer to [6] for a longer discussion. Many approaches to LTL learning have been explored. One common and natural method involves using search-based program synthesis, often paired with templates or sketches, such as parts of formulae, automata, or regular expressions. Another leverages SAT solvers. LTL learning is also being pursued using Bayesian inference, or inductive logic programming. Learning specifically tailored small fragments of LTL often yields the best results in practice [31, 32]. Learning from noisy data is investigated in [15, 26, 29]. All have in common is that they don’t scale, and have not been optimised for GPUs. Traces arising in industrial practice are commonly long (millions of characters), and numerous (millions of traces). Extracting useful information automatically at such scale is currently a major problem, e.g., the state-of-the-art learner in [31, 32] cannot reliably learn formulae greater than size 10. This is less than ideal. Our aim is to change this.

Graphics Processing Units (GPUs) are the work-horses of high-performance computing. The acceleration they provide to applications compatible with their programming paradigm can surpass CPU performance by several orders of magnitude, as notably evidenced by the advancements in deep learning. A significant spectrum of applications, especially within automated reasoning-like SAT/SMT solvers and model checkers-has yet to reap the benefits of GPU acceleration. In order for an application to be “GPU-friendly", it needs to have high parallelism, minimal data-dependent branching, and predictable data movement with substantial data locality [8, 17, 18]. Current automated reasoning algorithms are predominantly branching-intensive and appear sequential in nature, but it is unclear whether they are inherently sequential, or can be adapted to GPUs.

Research question. Can we scale LTL learning to at least 1000 times more traces without sacrificing trace length, learning speed or approximation ratio (cost increase of learned formula over minimum) compared to existing work, by employing suitably adapted algorithms on a GPU?

We answer the RQ in the affirmative by developing the first GPU-accelerated LTL learner. Our work takes inspiration from [33], the first GPU-accelerated minimal regular expression inferencer. Scaling has two core orthogonal dimensions: more traces, and longer traces. We solve one problem [33] left open: scaling to more traces. Our key decision, giving up on learning minimal formula while remaining sound and complete, enables two principled algorithmic techniques.

Divide-and-conquer (D &C). If a learning task has too many traces, split it into smaller specifications, learn those recursively, and combine the learned formulae using logical connectives.
Relaxed uniqueness checks (RUCs). Often generate-and-test program synthesis caches synthesis results to avoid recomputation. [33] granted cache admission only after a uniqueness check. We relax uniqueness checking by (pseudo-)randomly rejecting some unique formulae.

In addition, we design novel algorithms and data structures, representing LTL formulae as contiguous matrices of bits. This allows a GPU-friendly implementation of all logical operations with linear memory access and suitable machine instructions, free from data-dependent branching. Both D &C and RUCs may lose minimality and are thus unavailable to [33]. Our benchmarks show that the approximation ratio is typically small.

Contributions . In summary, our contributions are as follows:

A new enumeration algorithm for LTL learning, with a branch-free implementation of LTL semantics that is $O(\log n)$ in trace length (assuming unit cost for logical and shift operations).
A CUDA implementation of the algorithm, for benchmarking and inspection.
A parameterised benchmark suite useful for evaluating the performance of LTL learners, and a novel methodology for quantifying the loss of minimality induced by approximate LTL learning.
Performance benchmarks showing that our implementation is both faster, and can handle orders of magnitude more traces, than existing work.

2 Formal Preliminaries

We write $\# S$ for the cardinality of set S. $\mathbb {N}= \{0, 1, 2, ...\}$, [n] is for $\{0, 1, ..., n-1\}$ and [m, n] for $\{m, m+1, ..., n-1\}$. $\mathbb {B}$ is $\{0, 1\}$ where 0 is falsity and 1 truth. $\mathfrak {P}(A)$ is the powerset of A. The characteristic function of a set S is the function $\textbf{1}^{A}_{S} : A \rightarrow \mathbb {B}$ which maps $a \in A$ to 1 iff $a \in S$. We usually write $\textbf{1}_{S}$ for $\textbf{1}^{A}_{S}$. An alphabet is a finite, non-empty set $\varSigma $, the elements of which are characters. A string of length $n \in \mathbb {N}$ over $\varSigma $ is a map $w : [n] \rightarrow \varSigma $. We write $|\!|w|\!|$ for n. We often write $w_{i}$ instead of w(i), and $v \cdot w$, or just vw, for the concatenation of v and w, $\epsilon $ for the empty string and $\varSigma ^{*}$ for all strings over $\varSigma $. A trace is a string over powerset alphabets, i.e., $(\mathfrak {P}(\varSigma ))^{*}$. We call $\varSigma $ the alphabet of the trace and write $\textsf{traces}(\varSigma )$ for all traces over $\varSigma $. A word is a trace where each character has cardinality 1. We abbreviate words to the corresponding strings, e.g., $\langle \{t\}, \{i\}, \{n\} \rangle $ to tin. We say v is a suffix of w if $w = u v$, and if $|\!|u|\!| = 1$ then v is an immediate suffix. We write $\textsf{sc}(S)$ for the suffix-closure of S. S is suffix-closed if $\textsf{sc}(S) \subseteq S$. $\textsf{sc}^{+}(S)$ is the non-empty suffix closure of S, i.e., $\textsf{sc}(S) \setminus \{\epsilon \}$. From now on we will speak of the suffix-closure to mean the non-empty suffix closure. The Hamming-distance between two strings s and t of equal length, written $\textsf{hamm}(s, t)$, is the number of indices i where $s(i) \ne t(i)$. We write $ \textsf{Hamm}(s, \delta )$ for the set $\{t \in \varSigma ^*\ |\ \textsf{hamm}(s, t) = \delta , |\!|s|\!| = |\!|t|\!|\}$.

LTL formulae over $\varSigma = \{p_1, ..., p_n\}$ are given by the following grammar.

The subformulae of $\phi $ are denoted $\textsf{sf}(\phi )$. We say $\psi \in \textsf{sf}(\phi )$ is proper if $\phi \ne \psi $. A formula is in negation normal form (NNF) if all subformulae containing negation are of the form $\lnot p$. It is $\mathbin {\textsf{U}}$ -free if no subformula is of the form $\phi \mathbin {\textsf{U}}\psi $. We write $\textsf{LTL}(\varSigma )$ for the set of all LTL formulae over $\varSigma $. We use $\textsf{true}$ as an abbreviation for $p \vee \lnot p$ and $\textsf{false}$ for $p \wedge \lnot p$. We call $\textsf{X}, \textsf{F}, \textsf{G}, \mathbin {\textsf{U}}$ the temporal connectives, $\wedge , \vee , \lnot $ the propositional connectives, p the atomic propositions and, collectively name them the LTL connectives. Since we learn from finite traces, we interpret LTL over finite traces [10]. The satisfaction relation $tr, i \models \phi $, where tr is a trace over $\varSigma $ and $\phi $ from $\textsf{LTL}(\varSigma )$ is standard, here are some example clauses: $tr, i \models \textsf{X}\phi $, if $tr, i+1 \models \phi $, $tr, i \models \textsf{F}\phi $, if there is $i \le j < |\!|tr|\!|$ with $tr, j \models \phi $, and $tr, i \models \phi \mathbin {\textsf{U}}\phi '$, if there is $i \le j < |\!|tr|\!|$ such that: $tr, k \models \phi $ for all $i \le k < j$, and $tr, j \models \phi '$. If $i \ge |\!|tr|\!|$ then $tr, i \models \phi $ is always false, and $tr \models \phi $ is short for $tr, 0 \models \phi $.

A cost-homomorphism is a map $ \textsf{cost}(\cdot )$ from LTL connectives to positive integers. We extend it to LTL formulae homomorphically: $\textsf{cost}(\phi \ op\ \psi ) = \textsf{cost}(op) + \textsf{cost}(\phi ) + \textsf{cost}(\psi )$, and likewise for other arities. If $\textsf{cost}(op) = 1$ for all LTL connectives we speak of uniform cost. So the uniform cost of $\textsf{true}$ and $\textsf{false}$ is 4. From now on all cost-homomorphisms will be uniform, except where stated otherwise.

A specification is a pair (P, N) of finite sets of traces such that $P \cap N = \emptyset $. We call P the positive examples and N the negative examples. We say $\phi $ satisfies, separates or solves (P, N), denoted $\phi \models (P, N)$, if for all $tr \in P$ we have $tr \models \phi $, and for all $tr \in N$ we have $tr \not \models \phi $. A sub-specification of (P, N) is any specification $(P', N')$ such that $P' \subseteq P$ and $N' \subseteq N$. Symmetrically, (P, N) is an extension of $(P', N')$. We can now make the LTL learning problem precise:

Input: A specification (P, N), and a cost-homomorphism $\textsf{cost}(\cdot )$.
Output: An LTL formula $\phi $ that is sound, i.e., $\phi \models (P, N)$, and minimal, i.e., $\psi \models (P, N)$ implies $\textsf{cost}(\phi ) \le \textsf{cost}(\psi )$.

Cost-homomorphisms let us influence LTL learning: e.g., by assigning a high cost to a connective, we prevent it from being used in learned formulae. The language at i of $\phi $, written $\textsf{lang}(i, \phi )$, is $\{tr \in \textsf{traces}(\varSigma )\ |\ tr, i \models \phi \}$. We write $\textsf{lang}(\phi )$ as a shorthand for $\textsf{lang}(0, \phi )$ and speak of the language of $\phi $. We say $\phi $ denotes a language $S \subseteq \varSigma ^*$, resp., a trace $tr \in \varSigma ^*$, if $\textsf{lang}(\phi ) = S$, resp., $\textsf{lang}(\phi ) = \{tr\}$. We say two formulae $\phi _1$ and $\phi _2$ are observationally equivalent, written $\phi _1 \simeq \phi _2$, if they denote the same language. Let S be a set of traces. Then we write

$$ \phi _1 \simeq \phi _2 \mod S \qquad \text {iff}\qquad \textsf{lang}(\phi _1) \cap S = \textsf{lang}(\phi _2) \cap S $$

and say $\phi _1$ and $\phi _2$ are observationally equivalent modulo S. The following related definitions will be useful later. Let (P, N) be a specification. The cardinality of (P, N), denoted $\# (P, N)$, is $\# P + \# N$. The size of a set S of traces, denoted $|\!|S|\!|$ is $\varSigma _{tr \in S} |\!|tr|\!|$. We extend this to specifications: $|\!|(P, N)|\!|$ is $|\!|P|\!| + |\!|Q|\!|$. The cost of a specification (P, N), written $\textsf{cost}(P, N)$ is the uniform cost of a minimal sound formula for (P, N). An extension of (P, N) is conservative if any minimal sound formula for (P, N) is also minimal and sound for the extension. We note a useful fact: if $\phi $ is a minimal solution for (P, N), and also $\phi \models (P', N')$ then $(P \cup P', N \cup N')$ is a conservative extension.

Overfitting. It is possible to express a trace tr, respectively a set S of traces, by a formula $\phi $, in the sense that $\textsf{lang}(\phi ) = \{tr\}$, resp., $\textsf{lang}(\phi ) = S$. We define the function $\textsf{overfit}(\cdot )$ on sets of characters, traces, sets and specifications as follows.

$\textsf{overfit}(\{a_1, ..., a_k\}) = (\bigwedge _{i} a_i) \wedge \bigwedge _{b \in \varSigma \setminus \{a_1, ..., a_k\}} \lnot b $.
$\textsf{overfit}(\epsilon ) = \lnot \textsf{X}(\textsf{true})$ and $\textsf{overfit}(a \cdot tr) = \textsf{overfit}(a) \wedge \textsf{X}(\textsf{overfit}(tr))$.
$\textsf{overfit}(S) = \bigvee _{tr \in S} \textsf{overfit}(tr)$
$\textsf{overfit}(P, N) = \textsf{overfit}(P)$

The following are immediate from the definitions: (i) For all specifications (P, N): $\textsf{lang}(\textsf{overfit}(P, N)) = P$, (ii) $\textsf{overfit}(P, N) \models (P, N)$, and (iii) the cost of overfitting, i.e., $\textsf{cost}(\textsf{overfit}(P, N))$, is $O(|\!|P|\!| + \# \varSigma )$. Note that $\textsf{overfit}(P, N)$ is overfitting only on P, and (ii) justifies this choice.

3 High-Level Structure of the Algorithm

Figure 1 shows the two main parts of our algorithm: the divide-and-conquer unit, short D &C-unit, and the enumerator. Currently, only the enumerator is implemented for execution on a GPU. For convenience, our D &C-unit is in Python and runs on a CPU. Implementing the $ D \& C$-unit on a GPU poses no technical challenges and would make our implementation perform better.

Given (P, N), the D &C-unit checks if the specification is small enough to be solved by the enumerator directly. If not, the specification is recursively decomposed into smaller sub-specifications. When the recursive invocations return formulae, the D &C-unit combines them into a formula separating (P, N), see §7 for details. For small enough (P, N), the enumerator performs a bottom-up enumeration of LTL formulae by increasing cost, until it finds one that separates (P, N). Like the enumerator in [33], our enumerator uses a language cache to minimise re-computation, but with a novel cache admission policy (RUCs). The language cache is append-only, hence no synchronisation is required for read-access. The key difference from [33], our use of RUCs, is discussed in §4.

The enumerator has three core parameters.

T = maximal number of traces in the specification (P, N).
L = number of bits usable for representing each trace from (P, N) in memory.
W = number of bits (P, N) hashed to during enumeration.

We write Enum(T, L, W) to emphasise those parameters. Our current implementation hard-codes all parameters as Enum(64, 64, 128)^{Footnote 1}, but the abstract algorithm does not depend on this. The choice of $W = 128$ is a consequence of the current limitations of WarpCore [21, 22], a CUDA library for high-performance hashing of 32 and 64 bit integers. All three parameters heavily affect memory consumption. We chose $T = 64$ and $L = 64$ for convenient comparison with existing work in §8. While T, L and M are parameters of the abstract algorithm, the implementation is not parameterised: changing these parameters requires changing parts of the code. Making the implementation fully parametric is conceptually straightforward, but introduces a substantial number of new edge cases, primarily where parameters are not powers of 2, which increases verification effort.

We now sketch the high-level structure of $\texttt{enum}$, the entry point of the enumerator, taking a specification and a cost-homomorphism as arguments. For ease of presentation, we use LTL formulae as search space. Their representation in the implementation is discussed in §4.

Line 4 checks if the learning problem can be solved with an atomic proposition. If not, Line 5 initialises the global language cache with the representation of atomic propositions, and search starts from the lowest cost upwards. For each cost c a new empty entry is added to the language cache. Line 8 then maps over LTL connectives and calls $\texttt{handleOp}$ to construct all formulae of cost c using all suitable lower cost entries in the language cache. When no sound formula can be found with cost less than $\textsf{cost}(\textsf{overfit}(P, N))$, the algorithm terminates, returning $\textsf{overfit}(P, N)$. This makes our algorithm complete, in the sense of learning a formula for every specification.

The function $\texttt{handleOp}$ dispatches on LTL connectives, retrieves all previously constructed formulae of suitable cost from the language cache in parallel (we use $\texttt{for all}$ to indicate parallel execution), calls the appropriate semantic function, detailed in the next section, e.g., $\mathtt {branchfree\_F}$ for $\textsf{F}$, to construct $\mathtt {phi\_new}$, and then sends it to $\texttt{relaxedCheckAndCache}$ to check if it already solves the learning task, and, if not, for potential caching. Most parallelism in our implementation, and the upside of the language cache’s rapid growth, is the concomitant growth in available parallelism, which effortlessly saturates every conceivable processor.

This last step checks if $\mathtt {phi\_new}$ satisfies (P, N). If yes, the program terminates with the formula corresponding to $\mathtt {phi\_new}$. Otherwise, Line 28 conducts a RUC, a relaxed uniqueness check, described in detail in §6, to decide whether to cache $\mathtt {phi\_new}$ or not. Updating the language cache in Line 29 is done in parallel, and needs little synchronisation, see [33] for details. The satisfaction check in Line 26 guarantees that our algorithm is sound. It also makes it trivial to implement noisy LTL learning: just replace the precise check $\mathtt {phi_{n}ew \mathrel {|}= (p, n)}$ with a check that $\mathtt {phi\_new}$ gets a suitable fraction of the specification right.

4 In-Memory Representation of Search Space

Our enumerator does generate-and-check synthesis. That means we have two problems: (i) minimising the cardinality of the search space, i.e., the representation of LTL formulae during synthesis; (ii) making generation and checking as cheap as possible. For each candidate $\phi $ checking means evaluating the predicate

LTL formulae, the natural choice of search space, suffer from the redundancies of syntax: every language that is denoted by a formula at all, is denoted by infinitely many, e.g., $\textsf{lang}(\textsf{F}\phi ) = \textsf{lang}(\textsf{F}\textsf{F}\phi )$. Even observational equivalence distinguishes too many formulae: the predicate ($\dagger $) checks the language of $\phi $ only for elements of $P \cup N$. Formulae modulo $P \cup N$ contain exactly the right amount of information for ($\dagger $), hence minimise the search space. However, the semantics of formulae on $P \cup N$ is given compositionally in terms of the non-empty suffix-closure of $P \cup N$, which would have to be recomputed at run-time for each new candidate. Since $P \cup N$ remains fixed, so does $\textsf{sc}^{+}(P \cup N)$, and we can avoid such re-computation by using formulae modulo $\textsf{sc}^{+}(P \cup N)$ as search space. Inspired by [33], we represent formulae $\phi $ by characteristic functions $\textbf{1}_{\textsf{lang}(\phi )} : \textsf{sc}^{+}(P \cup N) \rightarrow \mathbb {B}$, which are implemented as contiguous bitvectors in memory, but with a twist. Fix a total order on $P \cup N$.

A characteristic sequence (CS) for $\phi $ over tr is a bitvector cs such that $tr, j \models \phi $ iff $cs(j) = 1$. For Enum(64, 64, 128), CSs are unsigned 64 bit integers.
A characteristic matrix (CM) representing $\phi $ over $P \cup N$, is a sequence cm of CSs, contiguous in memory, such that, if tr is the ith trace in the order, then cm(i) is the CS for $\phi $ over tr.

This representation has two interesting properties, not present in [33]: (i) each CS is suffix-contiguous: the trace corresponding to $cs(j+1)$ is the immediate suffix of that at cs(j); (ii) CMs contain redundancies whenever two traces in $P \cup N$ share suffixes. Redundancy is the price we pay for suffix-contiguity. Figure 2 visualises our representation in memory.

Logical Operations as Bitwise Operations. Suffix-contiguity enables the efficient representation of logical operations: if $cs = 10011$ represents $\phi $ over the word abcaa, e.g., $\phi $ is the atomic proposition a, then $\textsf{X}\phi $ is 00110, i.e., cs shifted one to the left. Likewise $\lnot \phi $ is represented by 01100, i.e., bitwise negation. As we use unsigned 64 bit integers to represent CSs, $\textsf{X}$ and negation are executed as single machine instructions! In Python-like pseudo-code:

Conjunction and disjunction are equally efficient. More interesting is $\textsf{F}$ which becomes the disjunction of shifts by powers of two, i.e., the number of shifts is logarithmic in the length of the trace (a naive implementation of $\textsf{F}$ is linear). We call this exponential propagation, and believe it to be novel in LTL synthesis^{Footnote 2}:

To see why this works, note that $\textsf{F}\phi $ can be seen as the infinite disjunction $\phi \vee \textsf{X}\phi \vee \textsf{X}^2 \phi \vee \textsf{X}^3 \phi \vee ...$, where $\textsf{X}^{n} \phi $ is given by $X^0 \phi = \phi $ and $X^{n+1} \phi = \textsf{X}\textsf{X}^n\phi $. Since we work with finite traces, $tr, i \not \models \phi $ whenever $i \ge |\!|tr|\!|$. Hence checking $tr, 0 \models \textsf{F}\phi $ for tr of length n amounts to checking

$$ tr, 0 \models \phi \vee \textsf{X}\phi \vee \textsf{X}^2 \phi \vee ... \vee \textsf{X}^{n-1}\phi $$

The key insight is that the imperative update $\mathtt {cs \mathrel {|}= cs \ll j}$ propagates the bit stored at $cs(i+j)$ into cs(i) without removing it from $cs(i+j)$. Consider the flow of information stored in $cs(n-1)$. At the start, this information is only at index $n-1$. This amounts to checking $tr, n-1 \models \textsf{X}^{n-1}\phi $. Thus assigning $\mathtt {cs\ \mathrel {|}= cs \ll 1}$ puts that information at indices $n-2, n-1$. This amounts to checking $tr, 0 \models \textsf{X}^{n-2}\phi \vee \textsf{X}^{n-1}\phi $. Likewise, then assigning $\mathtt {cs\ \mathrel {|}= cs \ll 2}$ puts that information at indices $n-4, n-3, n-2, n-1$. This amounts to checking $tr, 0 \models \textsf{X}^{n-4}\phi \vee \textsf{X}^{n-3}\phi \vee \textsf{X}^{n-2}\phi \vee \textsf{X}^{n-1}\phi $, and so on. In a logarithmic number of steps, we reach $tr, 0 \models \phi \vee \textsf{X}\phi \vee ... \vee \textsf{X}^{n-1}\phi $. This works uniformly for all positions, not just $n-1$. In the limit, this saves an exponential amount of work over naive shifting.

We can implement $\mathbin {\textsf{U}}$ using similar ideas, with the number of bitshifts also logarithmic in trace length. As with $\textsf{F}$, this works because we can see $\phi \mathbin {\textsf{U}}\psi $ as an infinite disjunction

$$ \psi \vee (\phi \wedge \textsf{X}\psi ) \vee (\phi \wedge \textsf{X}(\phi \wedge \textsf{X}\psi )) \vee \dots $$

We define (informally) $\phi \mathbin {\textsf{U}}_{\le p} \psi $ as: $\phi $ holds until $\psi $ does within the next p positions, and $\textsf{G}_{\ge p} \phi $ if $\phi $ holds for the next p positions. The additional insight allowing us to implement exponential propagation for $\mathbin {\textsf{U}}$ is to compute both, $\textsf{G}_{\ge 2^i} \phi $ and $\phi \mathbin {\textsf{U}}_{\le 2^i} \psi $, for increasing values of i at the same time.

In addition to saving work, exponential propagation maps directly to machine instructions, and is essentially branch-free code for all LTL connectives^{Footnote 3}, thus maximises GPU-friendliness of our learner. In contrast, previous learners like Flie [28], Scarlet [32] and Syslite [4], implement the temporal connectives naively, e.g., checking $tr, i \models \phi \mathbin {\textsf{U}}\psi $ by iterating from i as long as $\phi $ holds, stopping as soon as $\psi $ holds. Likewise, Flie encodes the LTL semantics directly as a propositional formula. For $\mathbin {\textsf{U}}$ this is quadratic in the length of tr for Flie and Syslite.

5 Correctness and Complexity of the Branch-Free Implementation of Temporal Operators

We reproduce here a slightly more general version (for any length) of the exponentially propagating algorithms for computing $\textsf{F}$ and $\mathbin {\textsf{U}}$.

They are lifted to CMs pointwise. As a warm-up, let us start with $\textsf{F}$.

Lemma 1

Let cs be the characteristic sequence for $\phi $ over a trace of length L. The algorithm above computes the characteristic sequence for $\textsf{F}\phi $. Assuming bitwise boolean operations and shifts by powers of two have unit costs, the complexity of the algorithm is $O(\log (L))$.

To ease notations, let us introduce $\textsf{F}_{\le p}$ with the semantics $tr, j \models \textsf{F}_{\le p} \phi $ if there is $j \le k < \min (j + p, |\!|tr|\!|)$ with $tr, k \models \phi $. The parameter $p \in \mathbb {N}$ in $ \textsf{F}_{\le p} \phi $ can be read as the number of positions in tr where $\phi $ will be evaluated at, starting from the current position j in tr.

Proof

Let us write cs for the characteristic sequence for $\phi $ over tr, and $cs_i$ for the characteristic sequence after the ith iteration. We write $\log (x)$ as a shorthand for $\lfloor \log _{2}(x) \rfloor $. We write $\textsf{F}cs$ for $\textsf{F}\phi $, and $\textsf{F}_{\le p} cs$ for $\textsf{F}_{\le p}\ \phi $. We show by induction that for all $i \in [0, \log (L)+1]$, for all tr of length L we have:

$$ \forall j \in [0, L],\ tr, j \models cs_i \Longleftrightarrow tr, j \models \textsf{F}_{\le 2^i} cs. $$

This is clear for $i = 0$, as it boils down to $cs_0 = cs$. Assuming it holds for i, by definition $cs_{i+1}(j) = cs_i(j) \vee (cs_i \ll 2^i)(j) = cs_i(j) \vee cs_i(j + 2^i)$, hence

$$ \begin{array}{lll} tr, j \models cs_{i+1} &{} \Longleftrightarrow &{} tr, j \models cs_i, \text {or}\ tr, j + 2^i \models cs_i \\ &{} \Longleftrightarrow &{} tr, j \models \textsf{F}_{\le 2^i} cs, \text {or}\ tr, j + 2^i \models \textsf{F}_{\le 2^i} cs \\ &{} \Longleftrightarrow &{} tr, j \models \textsf{F}_{\le 2^{i+1}} cs. \end{array} $$

This concludes the induction proof. For $i = \log (L)$ we obtain

$$ \forall j \in [0, L],\ tr, j \models cs_i \Longleftrightarrow tr, j \models \textsf{F}_{\le L} cs \Longleftrightarrow tr, j \models \textsf{F}cs, $$

since clearly $\textsf{F}= \textsf{F}_{\le L}$, when restricted to traces not exceeding L in length. $\square $

We now move to $\mathbin {\textsf{U}}$.

Lemma 2

Let $cs_1,cs_2$ the characteristic sequences for $\phi _1$ and $\phi _2$, both over traces of length L. The algorithm above computes the characteristic sequence for $\phi _1 \mathbin {\textsf{U}}\phi _2$. Assuming bitwise boolean operations and shifts by powers of two have unit costs, the complexity of the algorithm is $O(\log (L))$.

Again to ease notations, let us introduce $\mathbin {\textsf{U}}_{\le p}$ with the semantics $tr, j \models \phi _1 \mathbin {\textsf{U}}_{\le p} \phi _2$ if there is $j \le k < \min (j + p, |\!|tr|\!|)$ such that $tr, k \models \phi _2$ and for all $i \le k' < k$ we have $tr, k' \models \phi _1$. We will also need $\textsf{G}_{\le p}$ defined with the semantics $tr, j \models \textsf{G}_{\le p} \phi $ if for all $j \le k < \min (j + p, |\!|tr|\!|)$ we have $tr, k \models \phi $.

Proof

Let us write $cs_{1,i}$ and $cs_{2,i}$ for the respective characteristic sequences after the ith iteration. We show by induction that for all $i \in [0, \log (L)+1]$, for all tr of length L, for all $j \in [0, L]$, we have:

$tr, j \models cs_{1,i} \Longleftrightarrow tr, j \models \textsf{G}_{\le 2^i} cs_1$, and
$tr, j \models cs_{2,i} \Longleftrightarrow tr, j \models cs_1 \mathbin {\textsf{U}}_{\le 2^i} cs_2.$

This is clear for $i = 0$, as it boils down to $cs_{1,0} = cs_1$ and $cs_{2,0} = cs_2$. Assume it holds for i. Let us start with $cs_{1,i+1}$: by definition

$$\begin{aligned} cs_{1,i+1}(j) &= cs_{1,i}(j) \wedge (cs_{1,i} \ll 2^i)(j)\\ &= cs_{1,i}(j) \wedge cs_{1,i}(j + 2^i), \end{aligned}$$

hence it is the case that

$$ \begin{array}{lll} tr, j \models cs_{1,i+1} &{} \Longleftrightarrow &{} tr, j \models cs_{1,i}, \text {and}\ tr, j + 2^i \models cs_{1,i} \\ &{} \Longleftrightarrow &{} tr, j \models \textsf{G}_{\le 2^i} cs_1, \text { and }\ tr, j + 2^i \models \textsf{G}_{\le 2^i} cs_1 \\ &{} \Longleftrightarrow &{} tr, j \models \textsf{G}_{\le 2^{i+1}} cs_1. \end{array} $$

Now, by definition $cs_{2,i+1}(j) = cs_{2,i}(j) \vee (cs_{1,i}(j) \wedge (cs_{2,i} \ll 2^i)(j))$, which is equal to $cs_{2,i}(j) \vee (cs_{1,i}(j) \wedge cs_{2,i}(j + 2^i))$. Hence

$$ \begin{array}{lll} tr, j \models cs_{2,i+1} &{} \Longleftrightarrow &{} tr, j \models cs_{2,i}, \text {or }\ (tr, j \models cs_{1,i}, \text { and }\ tr, j + 2^i \models cs_{2,i}) \\ &{} \Longleftrightarrow &{} tr, j \models cs_1 \mathbin {\textsf{U}}_{\le 2^i} cs_2, \text {or }\ \\ &{} &{} \left( tr, j \models \textsf{G}_{\le 2^i} cs_1, \text {and }\ tr, j + 2^i \models cs_1 \mathbin {\textsf{U}}_{\le 2^i} cs_2 \right) \\ &{} \Longleftrightarrow &{} tr, j \models cs_1 \mathbin {\textsf{U}}_{\le 2^{i+1}} cs_2. \end{array} $$

This concludes the induction proof. For $i = \log (L)$ we obtain

$$ \forall j \in [0, L],\ tr, j \models cs_{2,i} \Longleftrightarrow tr, j \models cs_1 \mathbin {\textsf{U}}_{\le L} cs_2 \Longleftrightarrow tr, j \models cs_1 \mathbin {\textsf{U}}cs_2, $$

since clearly $\mathbin {\textsf{U}}= \mathbin {\textsf{U}}_{\le L}$ for all sufficiently short traces. $\square $

6 Relaxed Uniqueness Checks

Our choice of search space, formulae modulo $\textsf{sc}^{+}(P \cup N)$, while more efficient than bare formulae, still does not prevent the explosive growth of candidates: uniqueness of CMs is not preserved under LTL connectives. [33] recommends storing newly synthesised formulae in a “language cache”, but only if they pass a uniqueness check. Without this cache admission policy, the explosive growth of redundant CMs rapidly swamps the language cache with useless repetition. While uniqueness improves scalability, it just delays the inevitable: there are simply too many unique CMs. Worse: with Enum(64, 64, 128), CMs use up-to 32 times more memory than language cache entries in [33]. We improve memory consumption of our algorithm by relaxing strictness of uniqueness checks: we allow false positives (meaning that CMs are falsely classified as being already in the language cache), but not false negatives. We call this new cache admission policy relaxed uniqueness checks (RUCs). False positives mean that less gets cached. False positives are sound: every formula learned in the presence of false positives is separating, but no longer necessarily minimal—every minimal solution might have some of its subformulae missing from the language cache, hence cannot be constructed by the enumeration. False positives also do not affect completeness: in the worst case, our algorithm terminates by overfitting.

We implement the RUC using non-cryptographic hashing in several steps.

We treat each CM as a big bitvector, i.e., ignore its internal structure. Now there are two possibilities.
- The CM uses more than 126 bits. Then we hash it to 126 bits using a variant of MuellerHash from WarpCore.
- Otherwise we leave the CM unchanged (except padding it with 0 s to 126 bits where necessary).
Only if this 126 bit sequence is unique, it is added to the language cache.

If the CM is $\le 126$ bits, then the RUC is precise and enumeration performs a full bottom-up enumeration of CMs, so any learned formula is minimal cost. This becomes useful in benchmarking.

Note that RUCs implemented by hashing amount to a (pseudo-)random cache admission policy. Using RUCs essentially means that hash-collisions (pseudo-) randomly prevent formulae from being subformulae of any learned solution. It is remarkable that this works well in practice, but it probably means that LTL has sufficient redundancy in formulae vis-a-vis the probability of hash collisions. We leave a detailed theoretical analysis as future work.

7 Divide & Conquer

The D &C-unit’s job is, recursively, to split specifications until they are small enough to be solved by Enum(T, L, W) in one go, and, afterwards recombine the results. A naive D &C-strategy could split (P, N), when needed, into four smaller specifications $(P_i, N_j)$ for $i, j = 1, 2$, such that P is the disjoint union of $P_1$ and $P_2$, and N of $N_1$ and $N_2$. Then it learned the $\phi _{ij}$ recursively from the $(P_i, N_j)$, and finally combine all into

$$\begin{aligned} (\phi _{11} \wedge \phi _{12}) \vee (\phi _{21} \wedge \phi _{22}) \end{aligned}$$

which is sound for (P, N), but is not necessarily minimal. E.g., whenever $\phi _{11}$ implies $\phi _{12}$, then $\phi _{11} \vee (\phi _{21} \wedge \phi _{22})$ is lower cost^{Footnote 4}. Thus it might be tempting to minimise D &C-steps. Alas, the enumerator may run out-of-memory (OOM): the parameters in Enum(T, L, W) are static constraints, pertaining to data structure layout, and do not guarantee successful termination. Let us call the maximal cardinality $\# (P, N)$ that the D &C-unit sends directly to the enumerator, the split window. In order to navigate the trade-offs between avoiding OOM and minimising the approximation ratio, our refined D &C-units below use search to find as large as possible a split window. We write win for the split window parameter. Both implementations split specifications until they fit into the split window, i.e., $\# (P, N) \le win$, and then invoke the enumerator. The split window is then successively halved, until the enumerator no longer runs OOM but returns a sound formula.

Deterministic Splitting. The idea behind $\textsf{detSplit}(P, N, win)$ is: if $\# (P, N)$ $\le $ win, we send (P, N) directly to the enumerator. Otherwise, assume P is $\{p_1, ..., p_n\}$. Then $P_1 = \{p_1, ..., p_{n/2}\}$ and $P_2 = \{p_{n/2 + 1}, ..., p_n\}$ are the new positive sets, and likewise for N. (If the specification is given as two lists of traces, this is deterministic.) We then make 4 recursive calls, but remove redundancies in the calls’ arguments.

$\phi _{11} = \textsf{detSplit}(P_1, N_1, win)$,
$\phi _{12} = \textsf{detSplit}(P_1, N_2 \cap \textsf{lang}(\phi _{11}), win)$,
$\phi _{21} = \textsf{detSplit}(P_2 \setminus L, N_1, win)$,
$\phi _{22} = \textsf{detSplit}(P_2 \setminus L, N_2 \cap \textsf{lang}(\phi _{21}), win)$,

Here $L = \textsf{lang}(\phi _{11}) \cup \textsf{lang}(\phi _{12})$. Assuming that none of the 4 recursive calls returns OOM, the resulting formula is $(\phi _{11} \wedge \phi _{12}) \vee (\phi _{21} \wedge \phi _{22})$. Otherwise we recurse with $\textsf{detSplit}(P, N, win/2)$.

Random Splitting. This variant of the algorithm, written $\textsf{randSplit}(P, N, win)$, is based on the intuition that often a small number of traces already contain enough information to learn a formula for the whole specification. (E.g., the traces are generated by running the same system multiple times.)

$\phi _{11} = \textsf{aux}(P, N, win)$
$\phi _{12} = \textsf{randSplit}(P \cap \textsf{lang}(\phi _{11}), N \cap \textsf{lang}(\phi _{11}), win)$
$\phi _{21} = \textsf{randSplit}(P \setminus \textsf{lang}(\phi _{11}), N \setminus \textsf{lang}(\phi _{11}), win)$
$\phi _{22} = \textsf{randSplit}(P \setminus \textsf{lang}(\phi _{11}), N \cap \textsf{lang}(\phi _{11}), win)$

The function $\textsf{aux}(P, N, win)$ first construct a sub-specification $(P_{0}, N_{0})$ of (P, N) as follows. Select two random subsets $P_{0} \subseteq P$ and $N_{0} \subseteq N$, such that the cardinality of $(P_{0}, N_{0})$ is as large as possible but not exceeding win; in addition we require the cardinalities of $P_{0}$ and $N_{0}$ to be as equal as possible. Then $(P_0, N_0)$ is sent to the enumerator. If that returns OOM, $\textsf{aux}(P, N, win/2)$ is invoked, then $\textsf{aux}(P, N, win/4), ...$ until the enumerator successfully learns a formula. Once $\phi _{11}$ is available, the remaining $\phi _{ij}$ can be learned in parallel. Finally, we return $(\phi _{11} \wedge \phi _{12}) \vee (\phi _{21} \wedge \phi _{22})$.

Our benchmarks in §8 show that deterministic and random splitting display markedly different behaviour on some benchmarks.

8 Evaluation of Algorithm Performance

This section quantifies the performance of our implementation. We are interested in a comparison with existing LTL learners, but also in assessing the impact on LTL learning performance of different algorithmic choices. We benchmark along the following quantitative dimensions: number of traces the implementation can handle, speed of learning, and cost of inferred formulae. In our evaluation we are facing several challenges.

We are comparing a CUDA program running on a GPU with programs, sometimes written in Python, running on CPUs.
We run our benchmarks in Google Colab Pro. It is unclear to what extent Google Colab Pro is virtualised. We observed variations in CPU and GPU running times, for all implementations measured.
Existing benchmarks are too easy. They neither force our implementation to learn costly formulae, nor terminate later than the measurement threshold of around 0.2 s, a minimal time the Colab-GPU would take on any task, including toy programs that do nothing at all on the GPU.
Lack of ground-truth: how can we evaluate the price we pay for scale, i.e., the loss of formula minimality guarantees from algorithmic choices, when we do not know what this minimum is?

Hardware and Software Used for Benchmarking. Benchmarks below run on Google Colab Pro. We use Colab Pro because it is a widely used industry standard for running and comparing ML workloads. Colab CPU parameters: Intel Xeon CPU (“cpu family 6, model 79”), running at 2.20GHz, with 51 GB RAM. Colab GPU parameters: Nvidia Tesla V100-SXM2, with System Management Interface 525.105.17, with 16 GB memory. We use Python version 3.10.12, and CUDA version: 12.2.140. All our timing measurements are end-to-end (from invocation of $\textsf{learn}(P, N)$ to its termination), using Python’s $\texttt{time}$ library.

Benchmark Construction. Since existing benchmarks for LTL learning are too easy for our implementation, we develop new ones. A good benchmark should be tunable by a small number of explainable parameters that allows users to achieve hardness levels, from trivial to beyond the edge-of-infeasibility, and any point in-between. We now describe how we construct our new benchmarks.

By we denote the specifications generated using the following process: uniformly sample $2\cdot k$ traces from $\{tr \in \textsf{traces}(\varSigma )\ |\ lo \le |\!|tr|\!| \le hi\}$. Split them into two sets (P, N), each containing k traces.
By we mean using the sampler coming with Scarlet [32] to sample specifications (P, N) that are separated by $\phi $, where $\phi $ is a formula over the alphabet $\varSigma $. Both, P and N, contain k traces each, and for each $ tr \in P \cup N$ we have $ lo \le |\!|tr|\!| \le hi$. The probability distribution Scarlet implements is detailed in [32].
By , where $i, k \in \mathbb {N}$ and $c \in \{conservative, \lnot conservative\}$, we mean the following process, which we also call extension by sampling.
1. 1.
  Generate (P, N) with .
2. 2.
  Use our implementation to learn a minimal formula $\phi $ for (P, N) that is $\mathbin {\textsf{U}}$-free and in NNF (for easier comparison with Scarlet, which can neither handle general negation nor $\mathbin {\textsf{U}}$).
3. 3.
  Next we sample a specification $(P', N')$ from .
4. 4.
  The final specification is given as follows:
  - $(P \cup P', N \cup N')$ if $c = conservative$. Hence the final specification is a conservative extension of (P, N) and its cost is $\textsf{cost}(P, N)$.
  - $(P', N')$ otherwise.
Note that the minimal formula required in Step exists because for $i \le 8$, any (P, N) generated by has the property that $\# \textsf{sc}^{+}(P \cup N) \le 80 < 126$, so our algorithm uses neither RUCs nor D &C, but, by construction, does an exhaustive bottom-up enumeration that is guaranteed to learn a minimal sound formula.
By , with $l, \delta \in \mathbb {N}$, we mean specifications $ (\{tr\}, \textsf{Hamm}(tr, \delta )) $, where tr is sampled uniformly from all traces of length $l$ over $\varSigma $.

Benchmarks from are useful for comparison with existing LTL learners, and to hone in on specific properties of our algorithm. But they don’t fully address a core problem of using random traces: they tend to be too easy. One dimension of “too easy” is that specifications (P, N) of random traces often have tiny sound formulae, especially for large alphabets. Hence we use binary alphabets, the hardest case in this context. That alone is not enough to force large formulae. works well in our benchmarking: it generates benchmarks that are hard even for the GPU. We leave a more detailed investigation why as future work. Finally, in order to better understand the effectiveness of MuellerHash in our RUC, we use the following deliberately simple map from CMs to 126 bits.

First-k-percent (FKP). This scheme simply takes the first $k\%$ of each CS in the CM. All remaining bits are discarded. The percentage k is chosen such that the result is as close as possible to 126 bits. e.g., for a 64*63 bit CM, $k = 3$.

Comparison with Scarlet. In this section we compare the performance of our implementation against Scarlet [32], in order better to understand how much performance we gain in comparison with a state-of-the-art LTL learner. Our comparison with Scarlet is implicitly also a comparison with Flie [28] and Syslite [4] because [32] already benchmarks Scarlet against them, and finds that Scarlet performs better. We use the following benchmarks in our comparison.

All benchmarks from [32], which includes older benchmarks for Flie and Syslite.
Two new benchmarks for evaluating scalability to high-cost formulae, and to high-cardinality specifications.

In all cases, we learn $\mathbin {\textsf{U}}$-free formulae in NNF for easier comparison with Scarlet. This restriction hobbles our implementation which can synthesise cheaper formulae in unrestricted LTL.

Table 1. Comparison of Scarlet with our implementation on existing benchmarks. Timeout is 2000 s. On the existing benchmarks our implementation never runs OOM or out-of-time (OOT), while Scarlet runs OOM in 5.9% of benchmarks and OOT in 3.8%. In computing the average speedup we are conservative: we use 2000 s whenever Scarlet runs OOT, if Scarlet runs OOM, we use the time to OOM. The “Lower Cost” column gives the percentages of instances where our implementation learns a formula with lower cost than Scarlet, and likewise for “Equal” and “Higher”. Here and below, “Ave” is short for the arithmetic mean. The column on the right reports the average speedup over Scarlet of our implementation.

Full size table

Scarlet on Existing Benchmarks. We run our implementation in 12 different modes: D &C by deterministic, resp., random splitting, with two different hash functions (MuellerHash and FKP), and three different split windows (16, 32, and 64). The results are visualised in Table 1. We make the following observations. On existing benchmarks, our implementation usually returns formulae that are roughly of the same cost as Scarlet. They are typically only larger on benchmarks with a sizeable specification, e.g., 100000 traces, which forces our implementation into D &C, with the concomitant increase in approximation ratio due to the cost of recombination. However the traces are generated by sampling from trivial formulae (mostly $\textsf{F}p, \textsf{G}p$ or $\textsf{G}\lnot p$). Scarlet handles those well. [32] defines a parameterised family $\phi ^n_{seq}$ that can be made arbitrarily big by letting n go to infinity. However in [32] $n < 6$, and even on those Scarlet run OOM/OOT, while our implementation handles all in a short amount of time. On existing benchmarks, our implementation runs on average at least 46 times faster. We believe that this surprisingly low worst-case average speedup is largely because the existing benchmarks are too easy, and the timing measurements are dominated by GPU startup latency. The comparison on harder benchmarks below shows this.

Scarlet and High-Cost Specifications. The existing benchmarks can all be solved with small formulae. This makes it difficult to evaluate how our implementation scales when forced to learn high-cost formulae. In order to ameliorate this problem, we create a new benchmark using for $l = 3, 6, 9, ..., 48$ and $\delta = 1, 2$. We benchmark with the aforementioned 12 modes. The left of Table 2 summarises the results. This benchmark clearly shows that Scarlet is mostly unable to learn bigger formulae, while our implementation handles all swiftly.

Table 2. On the left, comparison between Scarlet and our implementation on benchmarks with $l = 3, 6, 9, ..., 48$ and $\delta = 1, 2$. Timeout is 2000 sec. Reported percentage is fraction of specifications that were successfully learned. On the right, comparison on benchmarks from for $k = 3, 4, 5, ..., 17$. All benchmarks were run to conclusion, Scarlet’s OOMs occurred between 1980.21 sec for $(2^{17}, 2^{17})$, and 16568.7 sec for $(2^{13}, 2^{13})$. Recall from §2 that $\# S$ denotes the cardinality of set S.

Scarlet and High-Cardinality Specifications. The previous benchmark addresses scalability to high-cost specifications. The present comparison with Scarlet seeks to quantify an orthogonal dimension of scalability: high-cardinality specifications. Our benchmark is generated by for $i = 5$, and $k = 3, 4, 5, ..., 17$. Using $i = 5$ ensures getting a few concrete times from Scarlet rather than just OOM/OOT; $k \le 17$ was chosen as Scarlet’s sampler makes benchmark generation too time-consuming otherwise. The choice of parameters also ensures that the cost of each benchmark is moderate $(\le 20)$. This means any difficulty with learning arises from the sheer number of traces. Unlike the previous two benchmarks, we run our implementation only in one configuration: using MuellerHash, and random splitting with window size 64 (the difference between the variants is too small to affect the comparison with Scarlet in a substantial way). The results are also presented on the right of Table 2. This benchmark clearly shows that we can handle specifications at least 2048 times larger, despite Scarlet having approx. 3 times more memory available. Moreover, not only is our implementation much faster and can handle more traces, it also finds substantially smaller formulae in all cases where a comparison is possible.

Hamming Benchmarks. We have already used in our comparison with Scarlet. Now we abandon existing learners, and delve deeper into the performance of our implementation by having it learn costly formulae. This benchmark is generated using for $l = 3, 6, 9, ..., 48$ and $\delta = 1, 2$. As above, the implementation learns $\mathbin {\textsf{U}}$-free formulae in NNF. Figure 3 gives a more detailed breakdown of the results. The uniform cost of overfitting on each benchmark is given for comparison^{Footnote 5}:

Length of tr	3	6	9	12	15	18	21	24	27	30	33	36	39	42	45	48
Cost of overfitting	22	38	55	73	92	111	129	147	164	182	201	219	238	256	274	292

This benchmark shows the following. Hamming benchmarks are hard for our implementation, and sometimes run for ${>}{3}$ min: we successfully force our implementation to synthesise large formulae, and that has an effect on running time. The figures on the left show that random splitting typically leads to smaller formulae in comparison with deterministic splitting, especially for $\delta = 2$. Indeed, we may be seeing a sub-linear increase in formula cost (relative to the cost of overfitting) for random splitting, while for deterministic splitting, the increase seems to be linear. In contrast, the running time of the implementation seems to be relatively independent from the splitting mechanism. It is also remarkable that the maximal cost we see is only about 3.5 times the cost of overfitting: the algorithm processes P and N, yet over-fitting happens only on P, which contains a single trace. Hence the cost of over-fitting (292 in the worst case) is not affected by N, which contains up to 4560 elements (of the same length as the sole positive trace).

Table 3. All run times are below the measurement threshold.

Full size table

Benchmarking RUCs. Our algorithm uses RUCs, a novel cache admission policy, and it is interesting to gain a more quantitative understanding of the effects of (pseudo-)randomly rejecting some CMs. We cannot hope to come to a definitive conclusion here. Instead we simply compare MuellerHash with FKP, which neither distributes values uniformly across the hash space (our 126 bits) to minimise collisions, nor has the avalanche effect where a small change in the input produces a significantly different hash output. This weakness is valuable for benchmarking because it indicates how much a hash function can degrade learning performance. (Note that for the edge case of specifications that can be separated from just the first $k\%$ alone, FKP should perform better, since it leaves the crucial bits unchanged.) The benchmark data is generated with . Table 3 summarises our measurements. We note the following. The loss in formula cost is roughly constant for each hash: it stabilises to around $0.2\%$ for MuellerHash, and a little above $2.5\%$ for FKP. Hence MuellerHash is an order of magnitude better. Nevertheless, even $2.5\%$ should be irrelevant in practice, and we conjecture that replacing MuellerHash with a cryptographic hash will have only a moderate effect on learning performance. A surprising number of instances run OOM, more so as specification size grows, with MuellerHash more than FKP. We leave a detailed understanding of these phenomena as future work.

Masking. The previous benchmarks suggested that naive hash functions like FKP sometimes work better than expected. Our last benchmark seeks to illuminate this in more detail and asks: can we relate the information loss from hashing and the concomitant increase in formula cost? A precise answer seems to be difficult. We run a small experiment: after MuellerHashing CMs of size 64*63 bits to 126 bits, we add an additional information loss phase: we mask out k bits, i.e., we set them to 0. This destroys all information in the k masked bits. After masking, we run the uniqueness check. We sweep over $k = 1, ..., 126$ with stride 5 to mask out benchmarks generated with . Figure 4 shows the results. Before running the experiments, the authors expected a gradual increase of cost as more bits are masked out. Instead, we see a phase transition when approx. 75 to 60 bits are not masked out: from minimal cost formulae before, to OOM/OOT after, with almost no intermediate stages. Only a tiny number of instances have 1 or 2 further cost levels between these two extremes. We leave an explanation of this surprising behaviour as future work.

9 Conclusion

The present work demonstrates the effectiveness of carefully tailored algorithms and data structures for accelerating LTL learning on GPUs. We close by summarising the reasons why we achieve scale: high degree of parallelism inherent in generate-and-test synthesis; application of divide-and-conquer strategies; relaxed uniqueness checks for (pseudo-)randomly curtailing the search space; and succinct, suffix-contiguous data representation, enabling exponential propagation where LTL connectives map directly to branch-free machine instructions with predictable data movement. All but the last are available to other learning tasks that have suitable operators for recombination of smaller solutions.

LTL and GPUs, a match made in heaven.

Notes

1.
For pragmatic reasons, our implementation uses only 126 bits of W=128, and 63 bits of L = 64, details omitted for brevity.
2.
By representing $\phi $ as a CS, i.e., unsigned integer, we can also read $\textsf{F}\phi $ as rounding up $\phi $ to the next bigger power of 2 and then subtracting 1, cf. [3].
3.
Including, mutatis mutandis, past-looking temporal connectives.
4.
Such redundancies can be eliminated, for example, by using theorem provers.
5.
For this benchmark we disable returning $\textsf{overfit}(P, N)$ when the cost of target formulae matches $\textsf{cost}(\textsf{overfit}(P, N))$, cf. Line 10 in the sketch of $\texttt{enum}$ in §3.

References

Github repository. https://github.com/MojtabaValizadeh/ltl-learning-on-gpus (2024)
Ammons, G., Bodík, R., Larus, J.R.: Mining specifications. In: Proceedings of the 29th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 4–16. POPL ’02, Association for Computing Machinery, New York, NY, USA (2002). https://doi.org/10.1145/503272.503275
Anderson, S.E.: Bit twiddling hacks: round up to the next highest power of 2 (2005). https://graphics.stanford.edu/~seander/bithacks.html
Arif, M.F., Larraz, D., Echeverria, M., Reynolds, A., Chowdhury, O., Tinelli, C.: SYSLITE: syntax-guided synthesis of PLTL formulas from finite traces. In: Proceedings of the International Conference on Formal Methods in Computer Aided Design, FMCAD, pp. 93–103 (2020). https://doi.org/10.34727/2020/isbn.978-3-85448-042-6_16
Camacho, A., McIlraith, S.A.: Learning interpretable models expressed in linear temporal logic. In: Proceedings of the Twenty-Ninth International Conference on Automated Planning and Scheduling, ICAPS 2019, Berkeley, CA, USA, July 11-15, 2019, pp. 621–630. AAAI Press (2019). https://ojs.aaai.org/index.php/ICAPS/article/view/3529
Camacho, A., McIlraith, S.A.: Learning interpretable models expressed in linear temporal logic. In: International Conference on Automated Planning and Scheduling, ICAPS. vol. 29, pp. 621–630 (2019). https://ojs.aaai.org/index.php/ICAPS/article/view/3529
Chen, Y.F., Farzan, A., Clarke, E.M., Tsay, Y.K., Wang, B.Y.: Learning minimal separating DFA’s for compositional verification. In: Kowalewski, S., Philippou, A. (eds.) Tools and Algorithms for the Construction and Analysis of Systems, pp. 31–45. Springer, Berlin Heidelberg, Berlin, Heidelberg (2009)
Chapter Google Scholar
Dally, W.J., Turakhia, Y., Han, S.: Domain-specific hardware accelerators. Commun. ACM 63(7), 48–57 (2020). https://doi.org/10.1145/3361682
David, C., Kroening, D.: Program Synthesis: Challenges and Opportunities. Philos. Trans. A 375(2104), 20150403 (2017)
Google Scholar
De Giacomo, G., Vardi, M.Y.: Linear temporal logic and linear dynamic logic on finite traces. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, pp. 854–860. IJCAI ’13, AAAI Press (2013)
Google Scholar
Fijalkow, N., Lagarde, G.: The complexity of learning linear temporal formulas from examples. In: Proceedings of the Fifteenth International Conference on Grammatical Inference. Proceedings of Machine Learning Research, vol. 153, pp. 237–250. PMLR (2021). https://proceedings.mlr.press/v153/fijalkow21a.html
Gabel, M., Su, Z.: Javert: fully automatic mining of general temporal properties from dynamic traces. In: Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 339–349. SIGSOFT ’08/FSE-16, Association for Computing Machinery, New York, NY, USA (2008). https://doi.org/10.1145/1453101.1453150
Gabel, M., Su, Z.: Symbolic Mining of Temporal Specifications. In: Proceedings of the 30th International Conference on Software Engineering, pp. 51–60. ICSE ’08, Association for Computing Machinery, New York, NY, USA (2008). https://doi.org/10.1145/1368088.1368096
Gabel, M., Su, Z.: Online Inference and Enforcement of Temporal Properties. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, pp. 15–24. ICSE ’10, Association for Computing Machinery, New York, NY, USA (2010). https://doi.org/10.1145/1806799.1806806
Gaglione, J., Neider, D., Roy, R., Topcu, U., Xu, Z.: Maxsat-based temporal logic inference from noisy data. Innovations Syst. Softw. Eng. 18(3), 427–442 (2022). https://doi.org/10.1007/S11334-022-00444-8
Article Google Scholar
Gulwani, S., Polozov, O., Singh, R.: Program Synthesis. Now Foundations and Trends (2017). http://ieeexplore.ieee.org/document/8187066
Hennessy, J., Patterson, D.: Computer Architecture: a quantitative approach. The Morgan Kaufmann Series in Computer Architecture and Design, Morgan Kaufmann (2017)
Google Scholar
Hwu, W.M.W., Kirk, D.B., Hajj, I.E.: Programming Massively Parallel Processors, Morgan Kaufmann (2022)
Google Scholar
Ielo, A., Law, M., Fionda, V., Ricca, F., De Giacomo, G., Russo, A.: Towards ILP-Based $\text{LTL}_{f}$ Passive Learning. In: Inductive Logic Programming, pp. 30–45. Springer Nature Switzerland, Cham (2023). https://doi.org/10.1007/978-3-031-49299-0_3
Jeppu, N., Melham, T., Kroening, D., O’Leary, J.: Learning Concise Models from Long Execution Traces. In: Proceedings of the 57th ACM/IEEE Design Automation Conference, DAC. pp. 1–6 (2020).https://doi.org/10.1109/DAC18072.2020.9218613
Jünger, D.: WARPCORE: hashing at the speed of light on modern CUDA-accelerators (2022). https://github.com/sleeepyjack/warpcore
Jünger, D., et al.: WarpCore: a library for fast hash tables on GPUs. In: Proceedings of the 27th International Conference on High Performance Computing, Data, and Analytics, HiPC, pp. 11–20 (2020). https://doi.org/10.1109/HiPC50609.2020.00015
Kim, J., Muise, C., Shah, A., Agarwal, S., Shah, J.: Bayesian inference of linear temporal logic specifications for contrastive explanations. In: International Joint Conference on Artificial Intelligence, IJCAI (2019). https://doi.org/10.24963/ijcai.2019/776
Lemieux, C., Beschastnikh, I.: Investigating program behavior using the texada LTL specifications miner. In: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 870–875. IEEE Computer Society, Los Alamitos, CA, USA (2015). https://doi.org/10.1109/ASE.2015.94
Lemieux, C., Park, D., Beschastnikh, I.: General LTL specification mining. In: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 81–92. IEEE Computer Society, Los Alamitos, CA, USA (2015). https://doi.org/10.1109/ASE.2015.71
Luo, W., Liang, P., Du, J., Wan, H., Peng, B., Zhang, D.: Bridging LTLf inference to GNN inference for learning LTLf formulae. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36(9), 9849–9857 (2022). https://doi.org/10.1609/aaai.v36i9.21221
Mascle, C., Fijalkow, N., Lagarde, G.: Learning temporal formulas from examples is hard (2023). https://doi.org/10.48550/arXiv.2312.16336
Neider, D., Gavran, I.: Learning linear temporal properties. In: Formal Methods in Computer Aided Design, FMCADm, pp. 1–10 (2018). https://doi.org/10.23919/FMCAD.2018.8603016
Peng, B., et al.: PURLTL: mining LTL specification from imperfect traces in testing. In: Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1766–1770. IEEE Computer Society, Los Alamitos, CA, USA (2023). https://doi.org/10.1109/ASE56229.2023.00202
Pnueli, A.: The temporal logic of programs. In: Proceedings of the 18th Annual Symposium on Foundations of Computer Science, FOCS, pp. 46–57 (1977). https://doi.org/10.1109/SFCS.1977.32
Raha, R., Rajarshi, R., Fijalkow, N., Neider, D.: Scarlet: scalable anytime algorithms for learning fragments of linear temporal logic (2024)
Google Scholar
Raha, R., Roy, R., Fijalkow, N., Neider, D.: Scalable anytime algorithms for learning fragments of linear temporal logic. In: TACAS 2022. LNCS, vol. 13243, pp. 263–280. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99524-9_14
Chapter Google Scholar
Valizadeh, M., Berger, M.: Search-based regular expression inference on a GPU. Proc. ACM Program. Lang. 7(PLDI), 1317–1339 (2023). https://doi.org/10.1145/3591274, technical report available at https://arxiv.org/abs/2305.18575, implementation: https://github.com/MojtabaValizadeh/paresy
Weimer, W., Necula, G.C.: Mining temporal specifications for error detection. In: Halbwachs, N., Zuck, L.D. (eds.) TACAS 2005. LNCS, vol. 3440, pp. 461–476. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-31980-1_30
Chapter Google Scholar
Yang, J., Evans, D., Bhardwaj, D., Bhat, T., Das, M.: Perracotta: mining temporal API rules from imperfect traces. In: Proceedings of the 28th International Conference on Software Engineering, pp. 282–291. ICSE ’06, Association for Computing Machinery, New York, NY, USA (2006). https://doi.org/10.1145/1134285.1134325
Yogananda Jeppu, N.: Learning symbolic abstractions from system execution traces. Ph.D. thesis, University of Oxford (2022)
Google Scholar

Download references

Acknowledgement

The first author thanks the University of Sussex, School of Engineering and Informatics for their generous funding, making this work possible. The second author acknowledges the support of the French PEPR Intelligence Artificielle SAIF project (ANR-23-PEIA-0006).

Author information

Authors and Affiliations

University of Sussex, Brighton, UK
Mojtaba Valizadeh & Martin Berger
Neubla UK Ltd., Cambridge, UK
Mojtaba Valizadeh
CNRS, LaBRI and Université de Bordeaux, Bordeaux, France
Nathanaël Fijalkow
Montanarius Ltd., London, UK
Martin Berger

Authors

Mojtaba Valizadeh
View author publications
You can also search for this author in PubMed Google Scholar
Nathanaël Fijalkow
View author publications
You can also search for this author in PubMed Google Scholar
Martin Berger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin Berger .

Editor information

Editors and Affiliations

University of Waterloo, Waterloo, ON, Canada
Arie Gurfinkel
Georgia Institute of Technology, Atlanta, GA, USA
Vijay Ganesh

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Valizadeh, M., Fijalkow, N., Berger, M. (2024). LTL Learning on GPUs. In: Gurfinkel, A., Ganesh, V. (eds) Computer Aided Verification. CAV 2024. Lecture Notes in Computer Science, vol 14683. Springer, Cham. https://doi.org/10.1007/978-3-031-65633-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-65633-0_10
Published: 26 July 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-65632-3
Online ISBN: 978-3-031-65633-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

LTL Learning on GPUs

Abstract

1 Introduction

2 Formal Preliminaries

3 High-Level Structure of the Algorithm

4 In-Memory Representation of Search Space

5 Correctness and Complexity of the Branch-Free Implementation of Temporal Operators

Lemma 1

Proof

Lemma 2

Proof

6 Relaxed Uniqueness Checks

7 Divide & Conquer

8 Evaluation of Algorithm Performance

9 Conclusion

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation