figure a
figure b

1 Introduction

A rigorous formalization of desired system requirements is indispensable when performing any verification-related task, such as model checking [7], synthesis [6], or runtime verification [20]. Writing formal specifications, however, is an error-prone and time-consuming manual task typically reserved for experts in the field. This paper presents nl2spec, a framework, accompanied by a web-based tool, to facilitate and automate writing formal specifications (in LTL [34] and similar temporal logics). The core contribution is a new methodology to decompose the natural language input into sub-translations by utilizing Large Language Models (LLMs). The nl2spec framework provides an interface to interactively add, edit, and delete these sub-translations instead of attempting to grapple with the entire formalization at once (a feature that is sorely missing in similar work, e.g., [13, 30]).

Figure 1 shows the web-based frontend of nl2spec. As an example, we consider the following system requirement given in natural language: “Globally, grant 0 and grant 1 do not hold at the same time until it is allowed”. The tool automatically translates the natural language specification correctly into the LTL formula G((!((g0 & g1)) U a)). Additionally, the tool generates sub-translations, such as the pair (“do not hold at the same time”, !(g0 & g1)), which help in verifying the correctness of the translation.

Fig. 1.
figure 1

A screenshot of the web-interface for nl2spec.

Consider, however, the following ambiguous example: “a holds until b holds or always a holds”. Human supervision is needed to resolve the ambiguity on the operator precedence. This can be easily achieved with nl2spec by adding or editing a sub-translation using explicit parenthesis (see Sect. 4 for more details and examples). To capture such (and other types of) ambiguity in a benchmark data set, we conducted an expert user study specifically asking for challenging translations of natural language sentences to LTL formulas.

The key insight in the design of nl2spec is that the process of translation can be decomposed into many sub-translations automatically via LLMs, and the decomposition into sub-translations allows users to easily resolve ambiguous natural language and erroneous translations through interactively modifying sub-translations. The central goal of nl2spec is to keep the human supervision minimal and efficient. To this end, all translations are accompanied by a confidence score. Alternative suggestions for sub-translations can be chosen via a drop-down menu and misleading sub-translations can be deleted before the next loop of the translation. We evaluate the end-to-end translation accuracy of our proposed methodology on the benchmark data set obtained from our expert user study. Note that nl2spec can be applied to the user’s respective application domain to increase the quality of translation. As proof of concept, we provide additional examples, including an example for STL [31] in the GitHub repositoryFootnote 1.

nl2spec is agnostic to machine learning models and specific application domains. We will discuss possible parameterizations and inputs of the tool in Sect. 3. We discuss our sub-translation methodology in more detail in Sect. 3.2 and introduce an interactive few-shot prompting scheme for LLMs to generate them. We evaluate the effectiveness of the tool to resolve erroneous formalizations in Sect. 4 on a data set obtained from conducting an expert user study. We discuss limitations of the framework and conclude in Sect. 5. For additional details, please refer to the complete version [8].

2 Background and Related Work

2.1 Natural Language to Linear-Time Temporal Logic

Linear-time Temporal Logic (LTL) [34] is a temporal logic that forms the basis of many practical specification languages, such as the IEEE property specification language (PSL) [22], Signal Temporal Logic (STL) [31], or System Verilog Assertions (SVA) [43]. By focusing on the prototype temporal logic LTL, we keep the nl2spec framework extendable to specification languages in specific application domains. LTL extends propositional logic with temporal modalities U (until) and X (next). There are several derived operators, such as \(\texttt {F} \varphi \equiv true \texttt {U} \varphi \) and \(\texttt {G} \varphi \equiv \lnot \texttt {F} \lnot \varphi \). \(\texttt {F} \varphi \) states that \(\varphi \) will eventually hold in the future and \(\texttt {G} \varphi \) states that \(\varphi \) holds globally. Operators can be nested: \(\texttt {G} \texttt {F} \varphi \), for example, states that \(\varphi \) has to occur infinitely often. LTL specifications describe a systems behavior and its interaction with an environment over time. For example given a process 0 and a process 1 and a shared resource, the formula \(\texttt {G} (r_0 \rightarrow \texttt {F} g_0) \wedge \texttt {G} (r_1 \rightarrow \texttt {F} g_1) \wedge \texttt {G} \lnot (g_0 \wedge g_1)\) describes that whenever a process requests \((r_i)\) access to a shared resource it will eventually be granted \((g_i)\). The subformula \(\texttt {G} \lnot (g_0 \wedge g_1)\) ensures that grants given are mutually exclusive.

Early work in translating natural language to temporal logics focused on grammar-based approaches that could handle structured natural language [17, 24]. A survey of earlier research before the advent of deep learning is provided in [4]. Other approaches include an interactive method using SMT solving and semantic parsing [15], or structured temporal aspects in grounded robotics [45] and planning [32]. Neural networks have only recently been used to translate into temporal logics, e.g., by training a model for STL from scratch [21], fine-tuning language models [19], or an approach to apply GPT-3 [13, 30] in a one-shot fashion, where [13] output a restricted set of declare templates [33] that can be translated to a fragment of LTLf [10]. Translating natural langauge to LTL has especially been of interest to the robotics community (see [16] for an overview), where datasets and application domains are, in contrast to our setting, based on structured natural language. Independent of relying on structured data, all previous tools lack a detection and interactive resolving of the inerherent ambiguity of natural language, which is the main contribution of our framework. Related to our approach is recent work [26], where generated code is iteratively refined to match desired outcomes based on human feedback.

2.2 Large Language Models

LLMs are large neural networks typically consisting of up to 176 billion parameters. They are pre-trained on massive amounts of data, such as “The Pile” [14]. Examples of LLMs include the GPT [36] and BERT [11] model families, open-source models, such as T5 [38] and Bloom [39], or commercial models, such as Codex [5]. LLMs are Transformers [42], which is the state of the art neural architecture for natural language proccessing. Additionally, Transformers have shown remarkable performance when being applied to classical problems in verification (e.g., [9, 18, 25, 40]), reasoning (e.g., [28, 50]), as well as the auto-formalization [35] of mathematics and formal specifications (e.g., [19, 21, 49]).

In language modelling, we model the probability of a sequence of tokens in a text [41]. The joint probability of tokens in a text is generally expressed as [39]:

$$\begin{aligned} p(x) = p(x_1,\ldots ,x_T) = \prod ^T_{t=1}p(x_t|x_{<t}) , \end{aligned}$$

where x is the sequence of tokens, \(x_t\) represents the t-th token, and \(x_{<t}\) is the sequence of tokens preceding \(x_t\). We refer to this as an autoregressive language model that iteratively predicts the probability of the next token. Neural network approaches to language modelling have superseded classical approaches, such as n-grams [41]. Especially Transformers [42] were shown to be the most effective architecture at the time of writing [1, 23, 36].

While fine-tuning neural models on a specific translation task remains a valid approach showing also initial success in generalizing to unstructured natural language when translating to LTL [19], a common technique to obtain high performance with limited amount of labeled data is so-called “few-shot prompting” [3]. The language model is presented a natural language description of the task usually accompanied with a few examples that demonstrate the input-output behavior. The framework presented in this paper relies on this technique. We describe the proposed few-shot prompting scheme in detail in Sect. 3.2.

Currently implemented in the framework and used in the expert-user study are Codex and Bloom, which showed the best performance during testing.

Codex and GPT-3.5-turbo. Codex [5] is a GPT-3 variant that was initially of up to 12B parameters in size and fine-tuned on code. The initial version of GPT-3 itself was trained on variations of Common Crawl,Footnote 2 Webtext-2 [37], two internet-based book corpora and Wikipedia [3]. The fine-tuning dataset for the vanilla version Codex was collected in May 2020 from 54 million public software repositories hosted on GitHub, using 159GB of training data for fine-tuning. For our experiments, we used the commercial 2022 version of code-davinci-002, which is likely larger (in the 176B rangeFootnote 3) than the vanilla codex models. GPT-3.5-turbo is the currently available follow-up model of GPT-3.

Bloom. Bloom [39] is an open-source LLM family available in different sizes of up to 176B parameters trained on 46 natural languages and 13 programming languages. It was trained on the ROOTS corpus [27], a collection of 498 huggingface [29, 48] datasets consisting of 1.61 terabytes of text. For our experiments, we used the 176B version running on the huggingface inference APIFootnote 4.

3 The nl2spec Framework

3.1 Overview

The framework follows a standard frontend-backend implementation. Figure 2 shows an overview of the implementation of nl2spec. Parts of the framework that can be extended for further research or usage in practice are highlighted. The framework is implemented in Python 3 and flask [44], a lightweight WSGI web application framework. For the experiments in this paper, we use the OpenAI library and huggingface (transformer) library [47]. We parse the LTL output formulas with a standard LTL parser [12]. The tool can either be run as a command line tool, or with the web-based frontend.

Fig. 2.
figure 2

Overview of the nl2spec framework with a human-in-the-loop: highlighted areas indicate parts of the framework that are effortlessly extendable.

The frontend handles the interaction with a human-in-the-loop. The interface is structured in three views: the “Prompt”, “Sub-translations”, and “Final Result” view (see Fig. 1). The tool takes a natural language sentence, optional sub-translations, the model temperature, and number of runs as input. It provides sub-translations, a confidence score, alternative sub-translations and the final formalization as output. The frontend then allows for interactively selecting, editing, deleting, or adding sub-translations. The backend implements the handling of the underlying neural models, the generation of the prompt, and the ambiguity resolving, i.e., computing the confidence score including alternative sub-translations and the interactive few-shot prompting algorithm (cf. Sect. 3.2). The framework is designed to have an easy interface to implement new models and write domain-specific prompts. The prompt is a .txt file that can be adjusted to specific domains to increase the quality of translations. To apply the sub-translation refinement methodology, however, the prompt needs to follow our interactive prompting scheme, which we introduce in the next section.

3.2 Interactive Few-Shot Prompting

The core of the methodology is the decomposition of the natural language input into sub-translations. We introduce an interactive prompting scheme that generates sub-translations using the underlying neural model and leverages the sub-translations to produce the final translation. Algorithm 1 depicts a high-level overview of the interactive loop. The main idea is to give a human-in-the-loop the options to add, edit, or delete sub-translations and feed them back into the language models as “Given translations” in the prompt (see Fig. 3). After querying a language model M with this prompt F, model specific parameters P and the interactive prompt that is computed in the loop, the model generates a natural language explanation, a dictionary of sub-translations, and the final translation. Notably, the model M can be queried multiple times as specified by the number of runs r, thereby generating multiple possible sub-translations. The confidence score of each sub-translation is computed as votes over multiple queries and by default the sub-translation with the highest confidence score is selected to be used as a given sub-translation in the next iteration. In the frontend, the user may view and select alternative generated sub-translations for each sub-translation via a drop-down menu (see Fig. 1).

Fig. 3.
figure 3

Prompt with minimal domain knowledge of LTL.

Figure 3 shows a generic prompt, that illustrates our methodology. The prompting scheme consists of three parts. The specification language specific part (lines 1–4), the fewshot examples (lines 5–19), and the interactive prompt including the natural language and sub-translation inputs (not displayed, given as input). The specification language specific part leverages “chain-of-thought” prompt-engineering to elicit reasoning from large language models [46]. The key of nl2spec, however, is the setup of the few-shot examples. This minimal prompt consists of two few-shot examples (lines 5–12 and 12–19). The end of an example is indicated by the “FINISH” token, which is the stop token for the machine learning models. A few-shot example in nl2spec consists of the natural language input (line 5), a dictionary of given translations, i.e., the sub-translations (line 5), an explanation of the translation in natural language (line 6–10), an explanation dictionary, summarizing the sub-translations, and finally, the final LTL formula.

This prompting scheme elicits sub-translations from the model, which serve as a fine-grained explanation of the formalization. Note that sub-translations provided in the prompt are neither unique nor exhaustive, but provide the context for the language model to generate the correct formalization.

figure c

4 Evaluation

In this section, we evaluate our framework and prompting methodology on a data set obtained by conducting an expert user study. To show the general applicability of this framework, we use the minimal prompt that includes only minimal domain knowledge of the specification language (see Fig. 3). This prompt has intentionally been written before conducting the expert user study. We limited the few-shot examples to two and even provided no few-shot example that includes “given translations”. We use the minimal prompt to focus the evaluation on the effectiveness of our interactive sub-translation refinement methodology in resolving ambiguity and fixing erroneous translations. In practice, one would like to replace this minimal prompt with domain-specific examples that capture the underlying distribution as closely as possible. As a proof of concept, we elaborate on this in the full version [8].

4.1 Study Setup

To obtain a benchmark dataset of unstructured natural language and their formalizations into LTL, we asked five experts in the field to provide examples that the experts thought are challenging for a neural translation approach. Unlike existing datasets that follow strict grammatical and syntatical structure, we posed no such restrictions on the study participants. Each natural language specification was restricted to one sentence and to five atomic propositions abcde. Note that nl2spec is not restricted to a specific set of atomic propositions (cf. Fig. 1). Which variable scheme to use can be specified as an initial sub-translation. We elaborate on this in the full version [8]. To ensure unique instances, the experts worked in a shared document, resulting in 36 benchmark instances. We provide three randomly drawn examples for the interested reader:

natural language S

LTL specification \(\psi \)

If b holds then, in the next step, c holds until a holds or always c holds

b -> X ((c U a) || G c)

If b holds at some point, a has to hold somewhere beforehand

(F b) -> (!b U (a & !b))

One of the following aps will hold at all instances: a,b,c

G( a | b | c)

The poor performance of existing methods (cf. Table 1) exemplify the difficulty of this data set.

4.2 Results

We evaluated our approach using the minimal prompt (if not otherwise stated), with number of runs set to three and with a temperature of 0.2.

Quality of Initial Translation. We analyze the quality of initial translations, i.e., translations obtained before any human interaction. This experiment demonstrates that the initial translations are of high quality, which is important to ensure an efficient workflow. We compared our approach to fine-tuning language models on structured data [19] and to an approach using GPT-3 or Rasa [2] to translate natural language into a restricted set of declare patterns [13] (which could not handle most of the instances in the benchmark data set, even when replacing the atomic propositions with their used entities). The results of evaluating the accuracy of the initial translations on our benchmark expert set is shown in Table 1.

At the time of writing, using Codex in the backend outperforms GPT-3.5-turbo and Bloom on this task, by correctly translating \(44.4\%\) of the instances using the minimal prompt. We only count an instance as correctly translated if it matches the intended meaning of the expert, no alternative translation to ambiguous input was accepted. Additionally to the experiments using the minimal prompt, we conducted experiments on an augmented prompt with in-distribution examples after the user study was conducted by randomly drawing four examples from the expert data set (3 of the examples haven’t been solved before, see the GitHub repository or full version for more details). With this in-distribution prompt (ID), the tool translates 21 instances (with the four drawn examples remaining in the set), i.e., \(58.3\%\) correctly.

Table 1. Translation accuracy on the benchmark data set, where B stands for Bloom and C stands for Codex and G for GPT-3.5-Turbo.

This experiment shows 1) that the initial translation quality is high and can handle unstructured natural language better than previous approaches and 2) that drawing the few-shot examples in distribution only slightly increased translation quality for this data set; making the key contributions of nl2spec, i.e., ambiguity detection and effortless debugging of erroneous formalizations, valuable. Since nl2spec is agnostic to the underlying machine learning models, we expect an even better performance in the future with more fine-tuned models.

Teacher-Student Experiment. In this experiment, we generate an initial set of sub-translations with Codex as the underlying neural model. We then ran the tool with Bloom as a backend, taking these sub-translations as input. There were 11 instances that Codex could solve initially that Bloom was unable to solve. On these instances, Bloom was able to solve 4 more instances, i.e., \(36.4\%\) with sub-translations provided by Codex. The four instances that Bloom was able to solve with the help of Codex were: “It is never the case that a and b hold at the same time.”, “Whenever a is enabled, b is enabled three steps later.”, “If it is the case that every a is eventually followed by a b, then c needs to holds infinitely often.”, and “One of the following aps will hold at all instances: a,b,c”. This demonstrates that our sub-translation methodology is a valid appraoch: improving the quality of the sub-translations indeed has a positive effect on the quality of the final formalization. This even holds true when using underperforming neural network models. Note that no supervision by a human was needed in this experiment to improve the formalization quality.

Ambiguity Detection. Out of the 36 instances in the benchmark set, at least 9 of the instances contain ambiguous natural language. We especially observed two classes of ambiguity: 1) ambiguity due to the limits of natural language, e.g., operator precedence, and 2) ambiguity in the semantics of natural language; nl2spec can help in resolving both types of ambiguity. Details for the following examples can be found in the full version [8].

An example for the first type of ambiguity from our dataset is the example mentioned in the introduction: “a holds until b holds or always a holds”, which the expert translated into (a U b) | G a. Running the tool, however, translated this example into (a U (b | G(a))). By editting the sub-translation of “a holds until b holds” to (a U b) through adding explicit parenthesis, the tool translates as intended. An example for the second type of ambiguity is the following instance from our data set: “Whenever a holds, b must hold in the next two steps.” The intended meaning of the expert was G (a -> (b | X b)), whereas the tool translated this sentence into G((a -> X(X(b)))). After changing the sub-translation of “b must hold in the next two steps” to b | X b, the tool translates the input as intended.

Fixing Erroneous Translation. With the inherent ambiguity of natural language and the unstructured nature of the input, the tool’s translation cannot be expected to be always correct in the first try. Verifying and debugging sub-translations, however, is significantly easier than redrafting the complete formula from scratch. Twenty instances of the data set were not correctly translated in an initial attempt using Codex and the minimal prompt in the backend (see Table 1). We were able to extract correct translations for 15 instances by performing at most three translation loops (i.e., adding, editing, and removing sub-translations), We were able to get correct results by performing 1.86 translation loops on average. For example, consider the instance, “whenever a holds, b holds as well”, which the tool mistakenly translated to G(a & b). By fixing the sub-translation “b holds as well” to the formula fragment -> b, the sentence is translated as intended. Only the remaining five instances that contain highly complex natural language requirements, such as, “once a happened, b won’t happen again” were need to be translated by hand.

In total, we correctly translated 31 out of 36 instances, i.e., \(86.11\%\) using the nl2spec sub-translation methodology by performing only 1.4 translation loops on average (see Table 1).

5 Conclusion

We presented nl2spec, a framework for translating unstructured natural language to temporal logics. A limitation of this approach is its reliance on computational resources at inference time. This is a general limitation when applying deep learning techniques. Both, commercial and open-source models, however, provide easily accessible APIs to their models. Additionally, the quality of initial translations might be influenced by the amount of training data on logics, code, or math that the underlying neural models have seen during pre-training.

At the core of nl2spec lies a methodology to decompose the natural language input into sub-translations, which are mappings of formula fragments to relevant parts of the natural language input. We introduced an interactive prompting scheme that queries LLMs for sub-translations, and implemented an interface for users to interactively add, edit, and delete the sub-translations, which avoids users from manually redrafting the entire formalization to fix erroneous translations. We conducted a user study, showing that nl2spec can be efficiently used to interactively formalize unstructured and ambigous natural language.