Keywords

figure a
figure b

1 Introduction

Temporal logics are indispensable for specifying and verifying the behavior of complex systems. Linear temporal logic (ltl) and its restriction to finite traces (\(\textsc {ltl}_{f}\)) are two especially useful members of the family. ltl, for example, has been widely adopted by the robotics community [4, 5, 10, 29, 37, 42, 45, 48, 60, 70]. \(\textsc {ltl}_{f}\) has applications to runtime verification [64], web-page testing [54], business process modeling [20, 22], process mining [16], planning [13, 24, 25], reinforcement learning [21], and image processing [65]. Furthermore, both logics support good decision procedures [67] and enable program synthesis [2, 3, 7, 11, 49, 56, 62, 71].

These successes all depend, however, on a crucial assumption: that users of the logics can actually write correct specifications. Given a well-formed but incorrect formula, synthesis will output a system that behaves as specified—whether or not that is the desired behavior. It is therefore critical to know the specific misunderstandings that lead to incorrect formulas in order to correct them via tools, logic design, and teaching. That is the focus of this paper.

Contributions and Outline. After a brief introduction to ltl, \(\textsc {ltl}_{f}\), and our pedagogy (Sect. 2), we proceed with the following contributions:

  • We introduce two test instruments (Sect. 3):

    • a finite trace instrument that tests respondents’ understanding of the delta between ltl and \(\textsc {ltl}_{f}\), and

    • an introductory instrument that promotes active learning of ltl.

  • We present a dataset of over 3,000 responses collected from dozens of respondents over the past three years (Sect. 4). The data contains mistakes from beginning, knowledgeable, and expert respondents (Sect. 6).

  • We present a catalog of ltl and \(\textsc {ltl}_{f}\) misconceptions (Sect. 5) that is thoroughly grounded in the data (Sect. 7).

The main results are in Sects. 6 and 7. The paper concludes with threats to validity (Sect. 8), related work (Sect. 9), and a brief discussion (Sect. 10).

2 Background

Fig. 1.
figure 1

Semantics of four ltl and \(\textsc {ltl}_{f}\) operators: G, F, X, U

ltl formulas are interpreted over infinite traces, \(\sigma = s_0 s_1 s_2 \cdots \), where each \(s_i\) is a state that provides valuations for a set of atomic propositions [55]. \(\textsc {ltl}_{f}\) formulas are interpreted over finite traces, \(\sigma _N = s_0 s_1 \cdots s_N\) [69]. While ltl and \(\textsc {ltl}_{f}\) share the same syntax, their semantics differ as shown by the in Fig. 1. This figure uses the notation \(\sigma (j)\) to select a suffix of \(\sigma \) starting from position j. For example, \(\sigma (2)\) is equal to \(s_2 \cdots \). An always (G) operator quantifies over all remaining states in the trace, an eventually (F) must find a satisfying suffix before the trace ends, a next (X, aka strong next) constrains the suffix after the current state, and an until (U) must find a satisfying suffix for its right operand and ensure that its left operand holds beforehand. Not pictured is the \(\textsc {ltl}_{f}\) weak next (\(X_W\), omitted to save space), which does not require that a next state exists.

2.1 ltl\(_{f}\) Example: Concision via Finiteness

Finite prefixes can be expressed within an infinite ltl trace, but doing so may require intricate formulas. To illustrate, consider a busy philosopher sitting in front of a bowl of ice cream. She has a lot of thinking to do, but . In \(\textsc {ltl}_{f}\), traces are finite. The end of a trace might correspond, e.g., to the termination of a program or the end of a data stream. Ending the trace at the point where the ice cream melts allows for a simple framing of this property:

figure e

By contrast, ltl requires a larger formula with a new variable (m: ice cream has melted) and a gadget to encode a prefix of an infinite trace.

figure f

2.2 Toward a Concept Inventory

This paper is part of a larger effort to create a set of concept inventory test instruments for ltl, \(\textsc {ltl}_{f}\), and related logics. Our guiding example is the Force Concept Inventory for teaching physics [39, 40], a multiple choice test in which every incorrect choice is carefully designed to match one specific misconception. Unless test-takers select the wrong choice by mistake, their results strongly suggest which concepts they need to review. We are developing test instruments that use a variety of question types to identify the misconceptions that a temporal logic concept inventory should cover.

In a perfect world, every course subject would come with a concept inventory. However, developing an inventory takes several rounds of careful study (e.g., via think-aloud interviews) to identify misconceptions and reliably pinpoint them among test-takers [1, 63]. One impediment to development is the expert blind spot [51, 52]; namely, that test designers overlook concepts that learners struggle with. Our Spreading X misconception (Sect. 7.6), for example, is an issue that we were blind to.

This paper builds on prior ltl instruments [35, 58] that employed a learner-driven tool called Quizius [59] to reduce the up-front cost of discovering misconceptions. Prior work [35] refined the instruments through three post-Quizius surveys, finding support for some potential misconceptions and discarding others. This paper represents a significant step forward in the iterative development of concept inventories with four additional studies that find misconceptions in ltl and in the unexplored domain of \(\textsc {ltl}_{f}\).

3 Instrument Design

Fig. 2.
figure 2

Example questions

This section describes the design of our study instruments. Complete instruments are in the artifact for this paper [34]. We contribute two instruments: a finite-trace instrument that contrasts \(\textsc {ltl}_{f}\) with ltl and an introductory instrument that assumes only minimal knowledge of ltl. The instruments are based on prior ltl work [35], reusing questions and question types that have proven effective in the past. The questions use simple state spaces with three on/off features such as the 3-color panel in Fig. 2.

The central question types ask about informal-to-formal translations:

  • Describe Formulas  (Fig. 2a): Given an ltl or \(\textsc {ltl}_{f}\) formula, translate it to an English-language description. This task is similar to what a person does when reading a specification and deciding whether it is correct.

  • Write Formulas  (Fig. 2b): Given an English statement, translate it to ltl and/or \(\textsc {ltl}_{f}\) or say that it is inexpressible. This is the key skill for doing formal verification. (“there must be a [informal-to-formal] transition” [26]).

Three other question types address specific goals. One type, Trace Matching, is from prior work [35]. The other two expose differences between ltl and \(\textsc {ltl}_{f}\).

  • Trace Matching  (Fig. 2c): Given a formula and a trace, mark the trace as either satisfying or violating. These questions test for specific, semantic misunderstandings. All traces were either finite or repeated the final state.

  • Explain Mismatches  (Fig. 2d): Given an \(\textsc {ltl}_{f}\) formula and a finite trace that violates the formula, explain the reason for the mismatch. The instructions suggest four potential explanations: (1) only an infinite trace can satisfy the formula; (2) the trace is too long, i.e., the formula accepts no traces of this length; (3) the trace is too short; or (4) trace content mismatch, i.e., the wrong lights are on/off in some states. These questions serve as a tutorial on the mismatches that can arise in a finite-trace setting.

  • Check Equations  (Fig. 2e): Given an equation and a statement of its validity in ltl, determine whether it is valid in \(\textsc {ltl}_{f}\) for non-empty traces. These questions test general ways in which ltl and \(\textsc {ltl}_{f}\) formulas differ.

3.1 ltl\(_{f}\) Instrument

The finite trace instrument is designed for an ltl-aware audience. This instrument has five parts, corresponding to the five question types above but arranged in order of difficulty rather than importance:

figure g

Part 1 functions as an \(\textsc {ltl}_{f}\) primer. It presents five mismatched formulas and traces and asks respondents to think critically about why the two disagree. For example, the trace in Fig. 2d is rejected by the formula \(F(red )\) because it has no red states. Respondents who expect F to accept an empty trace (similar to weak next) may be able to use this example to correct their misconception.

Parts 2, 3, and 4 appear in order of increasing difficulty so that respondents can build confidence as they approach the harder questions. There are six Trace Matching questions, four Describe Formulas questions, and five Write Formulas questions. The translation questions each ask about ltl and \(\textsc {ltl}_{f}\): respondents must provide two formulas (or two descriptions), or write “same” if the second would be identical. One question presents a formula that is insensitive to infiniteness [23], for which “same” is the correct response.

Part 5 presents three equations that are valid in ltl, such as \(!X(a)\,=\,X(!a)\), and one that is invalid in ltl: \(G(F(a))\,=\,F(G(a))\). Respondents must decide whether the equations are valid in \(\textsc {ltl}_{f}\).

3.2 ltl Instruments

We used two instruments with students: a new introductory instrument, and the ltl instrument from prior work [35]. Both instruments have three parts:

figure h

Part 1 uses lasso traces where the last shown state repeats indefinitely. The state space is a locomotive with three features: engine smoke, a door, and a headlight. Parts 2 and 3 ask for translations to and from ltl.

(1)

The first instrument is intended for students who have no knowledge of temporal logic. It presents nine of the easy-to-answer Trace Matching questions, and only two Describe questions and two Write questions. Some of the trace questions match the same formula with different traces to hone in on misconceptions. The translation questions intentionally do not ask about the until operator.

The second instrument is from prior work [35] with minor enhancements. It asks nine Match questions, five Describe questions, and five Write questions.

Table 1. Study contexts, number of respondents, and number of responses

4 Data

We deployed our instruments to four populations over three years. The finite-trace instrument went out to two semesters of students at a public UK university (\(\alpha \)’23, \(\alpha \)’24) and to the attendees of a symposium on \(\textsc {ltl}_{f}\) in artificial intelligence (FTAI—anonymized acronym). The introductory instrument was used in an embedded systems course at a private US university (\(\beta _{1}\), \(\beta _{2}\)). Between 18 and 24 respondents completed each instrument, and each participant contributed dozens of individual responses to the overall dataset. Table 1 provides the details. We hosted each instrument on Qualtrics.

4.1 Student \(\alpha \): 2023 and 2024

Populations \(\alpha \)’23 and \(\alpha \)’24 consisted of students enrolled in an elective course on self-programming agents, which is dedicated to various forms of \(\textsc {ltl}_{f}\) reactive synthesis and planning in the context of autonomous agents. Students can take this course in the final year of a BSc in computer science or during an MSc on Advanced CS. Both \(\alpha \) populations are similar and received comparable instruction, though we remark that the instructor joined the university in 2023. Early in the term, students received a lecture on ltl and completed the ltl instrument from prior work as a homework exercise. Shortly afterward, students received a lecture on \(\textsc {ltl}_{f}\) and completed the finite-trace instrument as homework. The ltl responses were of very high quality (92% correct in \(\alpha \)’23), so we analyze only the \(\textsc {ltl}_{f}\) responses in this paper.

The \(\alpha \)’23 instrument differs from the final, \(\alpha \)’24 instrument in two ways: the Explain Mismatches questions are multiple choice and there are three additional Check Equations questions (which did not lead to interesting incorrect responses). Free response is better for Explain Mismatches because it is less constraining. Respondents struggled when two choices might reasonably apply, and forcing them to choose was not helpful in our search for misconceptions.

4.2 FTAI: 2023

FTAI is our anonymized name for a symposium on finite-trace temporal logics for AI that was held in 2023. The event brought together world-class researchers with deep expertise in temporal logics including \(\textsc {ltl}_{f}\). Seventeen attendees (74%) self-reported AI as among their primary research areas, nine (39%) selected formal methods, and five (21%) selected machine learning. Eleven claimed to be knowledgeable in \(\textsc {ltl}_{f}\) specifically. All but a few attendees were in-person.

On the first day of the symposium, we presented (via Zoom) a brief introduction to our work on logic misconceptions and gave respondents 15 min to fill out the instrument. This introduction did not explain \(\textsc {ltl}_{f}\) semantics and it did not explain our question types; all instructions were in the instrument itself. Ten respondents completed the instrument in the allotted time. Eight respondents finished by the end of the conference. Six others finished later in Spring 2023; these may have been colleagues of symposium attendees, as we encouraged attendees to share the instrument link with their research groups.

Respondents in this study received only a subset of the \(\alpha \)’23 instrument to maximize the completion rate, which explains the relatively low number of responses in Table 1. They completed 3 out of 5 Explain Mismatches questions, 3 of 6 Trace Matching questions, 2 of 4 Describe Formulas questions, 2 of 5 Write Formulas questions, and 5 of 7 Check Equations questions—all selected uniformly at random by Qualtrics.

4.3 Student \(\beta \): 2022

Population \(\beta \) completed two instruments, \(\beta _{1}\) and \(\beta _{2}\), in the context of an elective undergraduate course on embedded systems taught at a private US university. The course has limited time to cover ltl-based model checking, making it critical to teach ltl quickly to students unfamiliar with temporal logic. In 2022, near the end of the semester, we assigned the introductory instrument as homework (\(\beta _{1}\)) without teaching ltl in lecture. Students had several days to read the course textbook [47] and submit. The next lecture featured ltl and assigned the full ltl instrument [35] as homework due the following week (\(\beta _{2}\)).

All homework in embedded systems was graded by participation. Furthermore, students were allowed to drop three homeworks during the term. We know from survey comments that at least two students were planning to drop an ltl homework, but since responses are anonymous and these comments appeared only in complete surveys, there is no reliable way to determine which of these students, if any, actually dropped an ltl homework.

5 Catalog Design

The catalog, or “code book” (in the qualitative analysis sense), is our rubric for temporal logic misconceptions. Figure 3 presents a short overview of the core semantic errors. Its aim is to provide just enough background for readers to understand our results in Sects. 6 and 7. The full catalog in our artifact comes with instructions showing how to apply the labels to new responses [34].

Fig. 3.
figure 3

Brief summary of misconceptions

In addition to the labels in Fig. 3, there are three meta labels: Precedence, RV, and Unlabeled. Precedence applies to responses that are ambiguous due to missing parentheses. RV stands for “Reasonable Variant,” and applies to written formulas that support an unintended reading of an English prompt. Unlabeled is for responses that contain several mistakes or otherwise defy categorization.

The labels are new to this work. and apply only to \(\textsc {ltl}_{f}\). , , , and apply to both \(\textsc {ltl}_{f}\) and ltl. The other labels originate in prior work [35]. We developed the new labels by starting from the prior catalog and applying techniques from grounded theory [33] to discover categories of mistakes. Two authors worked as labelers. First, the labelers independently assessed sample responses using the baseline catalog. Coding happened in small sessions to minimize labeler fatigue. Second, the labelers met to identify patterns among responses that did not fit the current rubric. Third, the labelers used the standard Cohen’s \(\kappa \) score [17] to check agreement. This measure typically ranges from 0 to 1, where a score above 0.8 is considered excellent [61]. The coders quickly reached a high score, perhaps due to the well-tested baseline catalog. Further details on instrument development follow:

  • Finite Trace: \(\kappa = 0.79\) after labeling 26 responses: 14 Write Formulas, 8 Describe Formulas, and 6 Check Equations.

  • Introductory: \(\kappa = 0.83\) after labeling 13 responses: 9 Write Formulas and 4 Describe Formulas.

6 Results: Incorrect Responses, Specific Errors

Our instruments collected a variety of errors across the four populations. Table 2 presents the totals at a high-level. Table rows correspond to question types (with abbreviated names, such as Explain for Explain Mismatches), and table columns name the instrument deployments. Each cell counts the number of incorrect responses (not the number of respondents who contributed these responses) and reports it as a percentage of the total responses for that particular instrument and question type. Be advised that percentages are not comparable across columns because the number of questions in each part may have changed; for example, Check Equations has 7 questions in \(\alpha \)’23 and 4 in \(\alpha \)’24.

Table 2. Total incorrect responses

The main takeaway from Table 2 is that every question type attracted some incorrect responses, and some attracted quite a few (over 20%). Trace Matching was the easiest question across the board and Write Formulas was the hardest; even the FTAI respondents submitted a fair number of incorrect formulas. Students in \(\beta _{1}\) submitted many incorrect responses. At a glance, it would seem that the \(\beta _{2}\) responses are only marginally better percentage-wise, but there were nearly twice as many translation questions in the \(\beta _{2}\) instrument and they were more difficult; the small percentage improvement is encouraging.

Table 3. Errors in incorrect responses (one response may match several labels)

Each incorrect response may correspond to zero or more misconceptions in our catalog, depending on why it is incorrect. Table 3 presents the catalog classification of the incorrect responses. The columns are grouped by three question types: Trace Matching, Describe Formulas, and Write Formulas. We discuss the other question types in prose below. Within each question type, columns correspond to deployments. The rows are labels from the catalog. Each cell counts the number of incorrect responses; we use a dash (-) rather than a zero to make the nonzero numbers easier to see.

Every core label has at least some support from the responses, with Bad State Index, Implicit F, and Implicit G being among the most popular. The Weak U label has low numbers, but these came primarily from a Trace Matching question that specifically tests this issue; the fact that even one FTAI participant made this mistake is noteworthy. Issues with trace length constraints (Length) are common in \(\textsc {ltl}_{f}\); see Sect. 7 for examples. Lastly, the low numbers for generic labels (Bad State Quantification and Other Implicit) and for reasonable variants (RV) suggest that the revised catalog is better at pinpointing issues and that the revised instruments are clearer to respondents.

We report some negative findings as well. Two labels, Cycle G and Trace-Split U, have little support overall and warrant targeted testing in the future. Unlabeled is unfortunately common, which suggests a need for interviews to learn the reasoning behind any deeply-incorrect responses. Some unlabeled responses in Table 3b do, however, have explanations. These are from respondents who were confused about ltl syntax, or who did not attempt the question.

Remaining Question Formats. The finite trace instruments include two question types that are not in Table 3a: Explain Mismatches and Check Equations. The incorrect Explain Mismatches responses are all Unlabeled; most of these are due to the multiple-choice ambiguity noted in Sect. 3.1. The incorrect Check Equations responses cannot be labeled definitively because these questions did not ask respondents to explain their reasoning (Fig. 2e). We merely note that the data suggests issues with Length, OtherImplicit, and a weak notion of F. The weak-F responses incorrectly marked \(F(a)\,=\,a \vee X(F(a))\) as invalid in \(\textsc {ltl}_{f}\).

7 Results: Categories of Errors

We turn now to the actual survey responses that support the new categories of errors; namely, the two \(\textsc {ltl}_{f}\) labels and four additional ltl labels. To ground the discussion, the subsections below present actual instrument questions (“Q”) and representative sample responses (“WA” for “wrong answer”). We also discuss how tools might use our findings to provide feedback.

Certain questions appeared only in the finite-trace instruments and vice-versa. These are noted below. Also, to streamline the presentation, we have translated the introductory-instrument responses to use colors instead of locomotive characteristics (compare Fig. 2 and Eq. (1)).

7.1 Length (ltl\(_{f}\) only)

The Length label applies to responses that require too many or too few states. When writing an \(\textsc {ltl}_{f}\) formula, this issue can arise from the use of strong next instead of weak next. Tools might help by reporting the trace length(s) that a formula accepts.

  • Q. Describe the \(\textsc {ltl}_{f}\) formula \({red \,\wedge \,!X(blue )}\).

  • WA. “The first state must be red and the second state must not be blue.”

    This answer implies that a second state must exist, but the formula does not. There are four responses of this sort in the dataset: two in \(\alpha \)’23, one in \(\alpha \)’24, and one in FTAI.

  • Q. Describe the \(\textsc {ltl}_{f}\) formula \(G(red \Rightarrow X(!red ~\wedge ~X(red )))\).

  • WA. “For every state, if there is a red light on, the next state is with the red light off, and the state afterward is with the red light on. The trace must have at least have 3 states.”

    No finite trace with a red light can satisfy this formula, as every red light demands another two states later. There are seven responses of this sort: five in \(\alpha \)’23 and one each in \(\alpha \)’24 and FTAI.

  • Q. Write an \(\textsc {ltl}_{f}\) formula for: Blue is on in the first state, off in the second state, and alternates on/off for the remaining states.

  • WA. \(blue ~\wedge ~G(blue \Rightarrow X_W(!blue ~\wedge ~X_W(blue )))\)

    The prompt requires at least two states, but the formula accepts traces with only one blue state. Interestingly, this formula is correct in ltl using X instead of \(X_W\), which underscores the subtlety of \(\textsc {ltl}_{f}\). Eight \(\alpha \)’23, one \(\alpha \)’24, and zero FTAI responses made this error.

7.2 Last (ltl\(_{f}\) only)

The Last label applies to responses that attempt to encode a final state in infinite-trace ltl instead of saying that the prompt is inexpressible. All such responses stem from one formula-writing question.

  • Q. Write (if possible) an ltl formula for: Green is on in the final state.

  • WA. \(F(G(green ))\)

    While this response is correct for \(\textsc {ltl}_{f}\) and is syntactically-valid ltl, it is trying to answer an impossible question. There are six responses of this sort: one from \(\alpha \)’23, five from \(\alpha \)’24, and zero from FTAI.

7.3 Cycle G

In ltl and \(\textsc {ltl}_{f}\), the G operator imposes a constraint on every state. Yet, some responses expect G to constrain one state, skip a few states, and reapply later. The skipped states are precisely those captured by occurrences of X within the G operand. A tool might help by highlighting atom constraints at each time index (in the following example, index 2 would show a contradiction).

  • Q. Write an ltl formula for: Blue is on in the first state, off in the second state, and alternates on/off for the remaining states.

  • WA. \(G(blue ~\wedge ~X(!blue ))\)

    This formula is unsatisfiable because it requires blue to be both on and off in the second state. There are four responses of this sort, two from \(\alpha \)’24 and two from FTAI. However, we must caution that these responses came from only two people who made the mistake consistently in ltl and \(\textsc {ltl}_{f}\).

7.4 Implicit Prefix

The baseline catalog contains a generic label Other Implicit for responses that accept too many traces but do not fall under a more precise category. One such response from FTAI describes \(G(red \Rightarrow X(!red \,\wedge \,X(red )))\) as “whenever red holds, it also holds two steps later,” leaving the middle state underconstrained.

The Implicit Prefix label narrows the scope of Other Implicit. It applies to responses that correctly describe the suffix of valid traces but leave the prefix underconstrained. It does not apply to the example in the previous paragraph. Tools might help by showing example traces; for instance, traces with early states that satisfy some but not all constraints under an F may be informative.

  • Q. Write an ltl formula for: Red is on exactly once.

  • WA. \(F(red ~\wedge ~X(G(!red )))\)

    This formula describes a suffix in which red is on at one state and turns off afterward, but it does not prevent red from turning on before this point. There are 24 responses of this sort: eight each from \(\alpha \)’23 and FTAI, and four each from \(\alpha \)’24 and \(\beta _{2}\). The finite-trace respondents made this mistake consistently in ltl and \(\textsc {ltl}_{f}\), so the total in terms of people is only 14.

  • Q. Write an ltl formula for: Green is on for zero or more states, then turns off and remains off in the future.

  • WA. \(G(F(!green ))\)

    Whereas the specification asks for green to stay on until it turns off, the formula allows green to turn on and off before reaching a non-green suffix. There are four responses of this sort in \(\beta _{2}\). This question is not in the finite-trace instruments because it does not contrast ltl and \(\textsc {ltl}_{f}\).

7.5 Trace-Split U

Several responses use F and G in the left operand of an until, as in \(G(red )~U~blue \). These responses are usually incorrect. Some of them would be correct, however, if the left and right operands were interpreted on different parts of the full trace: a prefix on the left and a suffix on the right. (Interpreting on a prefix makes no sense in ltl, but is sensible in \(\textsc {ltl}_{f}\).) The Trace-Split U label captures these responses. Tools can help by reporting such nested operands as a U antipattern.

  • Q. Write an ltl formula for: Blue is on in at least two states.

  • WA. \(F(blue )~U~F(blue )\)

    Any trace with one blue state satisfies the formula. There are two responses of this sort from FTAI and zero elsewhere.

  • Q. Write an ltl formula for: Green is on for zero or more states, then turns off and remains off in the future.

  • WA. \(G(green )~U~G(!green )\)

    Although a natural-language reading of this formula sounds compelling (always green until always not green), the left G would entail a green light in every state. There are two responses of this sort in \(\beta _{2}\). This question is not in the finite-trace instruments because it does not contrast ltl and \(\textsc {ltl}_{f}\).

7.6 Spreading X

The X operator targets one specific state whereas G, F, and U quantify over an unknown future. This difference is evidently confusing to beginners, as several of the \(\beta _{1}\) and \(\beta _{2}\) responses expect one X to constrain both the current state and the next state. With nesting, these responses expect a longer interval, e.g., three red states for \(X(X(red ))\). Prior work with novices observed this issue as well [58]. We did not find evidence for it in our earlier studies [35], so perhaps the misconception is easily corrected. Tools can help by reminding users that an n-fold composition of X constrains one state n steps ahead.

  • Q. Describe the ltl formula \(blue \Rightarrow X(X(X(blue )))\).

  • WA. “When the blue light is on, it will stay on for the next 3 states.”

    There are three such responses. This question is only in the \(\beta _{2}\) instrument.

  • Q. Write an ltl formula for: Red cannot stay on for 3 states in a row.

  • WA. \(G(!X(X(X(red ))))\) There are eight such responses in \(\beta _{1}\), and three in \(\beta _{2}\). The finite-trace instrument does not include this question.

8 Threats to Validity

Qualitative coding inherently comes with biases, and our high agreement scores do not prove that these have been excised. To mitigate this issue, our data is available for other researchers to audit. Another threat is that the sets over which we computed agreement are not large.

One author manually classified responses for correctness and may have mis-labeled some, despite our auditing. Write Formulas responses in particular might have leveraged automation, but the survey did not enforce an ltl syntax in order to lower the burden on respondents. Thus, there are variations such as or versus | and engine versus E that we had to normalize manually. One response uses next (perhaps inspired by PSL weak next [28]) without specifying a strong or weak interpretation. This ambiguity is a threat; fortunately, the response in question is incorrect in the same way with X or \(X_W\). Operator precedence is another avenue for miscommunication; we assume, e.g., weak precedence for implication, but respondents may have had a different meaning in mind.

Regarding external validity, the two \(\alpha \) studies took place at the same institution with the same instructor. The \(\beta \) study used a different institution and student population, and although the results are comparable to \(\alpha \) they may not carry over to other populations, such as learners in industry. FTAI respondents were under time pressure due to the conference, and may have rushed through the more difficult translation questions.

Two question types require fluency in English. Although we did not specifically check for fluency, our respondents seem to meet this bar. Both universities that we worked with conduct all classes in English and expect a high degree of fluency. The FTAI symposium used English as well for all papers and talks. There were no indications of severe language issues in the responses.

Our instruments are rather weak ecologically because they ask basic questions about a rudimentary state space. Practical uses of ltl would involve systems with interacting components, and users would have access to verification tools. Performing studies in a realistic setting is an important topic for future work.

9 Related Work

Design tools [15, 57], alternative languages and logics [6, 8, 28, 46, 66], pattern languages [27, 36, 43, 50, 57], natural-language translators [12, 18, 30] and error checkers [9, 14, 41, 44, 54], all seek to improve the usability of temporal logics such as ltl. Yet, none of these works study the misunderstandings of humans; at best, they address mistakes that a person might make.

Prior work on the Declare modeling language used think-aloud interviews to discover and validate errors [38]. Our work can help separate general ltl issues from Declare-specific issues. Other related user studies include two comparisons of ltl to similar logics [15, 19], and an interface design study [15]. While these studies target learners, the focus is not directly on logic misconceptions.

Our translation questions are similar to those from Iltis [31, 32], a tool for teaching logic. Iltis might serve as a framework for future studies, though it is aimed toward pedagogy rather than studies of misconceptions.

With the introductory instruments, we considered providing a link to Wickström’s ltl visualizer [54, 68]. We did not, due to concerns that misconceptions about the tool, which has not been validated, would be a confounding factor.

10 Looking Forward

We conducted a first study of \(\textsc {ltl}_{f}\) misconceptions in three populations with well-informed respondents, and studied ltl in two rounds with novices. The data offers insights into mis-specifications with two categories of \(\textsc {ltl}_{f}\)-specific mistakes, four new categories of ltl mistakes, and refined support for categories from prior work [35]. Given the very simple scenarios and formulas that we used, we suspect that many more issues lurk in more complicated settings.

Our work has obvious implications for learners and educators. We have already begun to employ its insights to create a new interactive learning environment called the LTL Tutor: https://www.ltl-tutor.xyz/. We have also had positive experiences in an undergraduate course on logical modeling [53] and in a graduate course on software verification. The instruments work well as an in-class activity followed by group discussion.

This work can also impact the design of future logics. Narrowly, it suggests different operator designs; broadly, it provides a methodology to identify misconceptions in the first place.

Finally, this work also has implications for tools that consume ltl or \(\textsc {ltl}_{f}\). Currently, tools assume that a logical utterance precisely captures the user’s intent, and verify, synthesize, or otherwise manifest exactly what was written. Our work can (and should!) be used to check for the presence of predictable errors, e.g., by checking that users really meant what they wrote (especially if they fall within a misconception category).