Keywords

1 Introduction

Formal specification languages are based on mathematical formalisms and are used to describe the expected behaviour of a software component. Formal specifications are increasingly embraced by software engineering professionals, in lightweight formal development techniques such as automated synthesis, testing or monitoring. Moreover, they will possibly become even more relevant as advances in large language models push programming activities into higher levels of abstraction [29].

Alloy [12, 13] is a formal specification language that allows the automatic analysis of software design models with rich structure and behaviour. Due to its high-level of abstraction, flexibility and simplicity, Alloy is often used in introductory formal methods coursesFootnote 1. Yet, studies show that novices, and even experienced professionals, struggle with understanding and writing Alloy specifications [17]. The Alloy4Fun [16] web platform was developed in this educational context to ease the sharing of specification challenges with auto-grading, supporting instructors in classes and allowing students to study autonomously. Intelligent tutoring systems (ITS) for programming have long relied on automated feedback to support students in large classes and outside the classroom. Alloy4Fun, like regular Alloy, is solver-based and provides feedback for incorrect specifications as graphical counter-examples. This is a popular feature among Alloy practitioners and could, in principle, act as hints to help students progress towards solving a challenge when learning autonomously. However, studies find visual counter-examples have mixed results with novices [7, 8]. In fact, a recent user study [6] with different kinds of manually encoded hints concluded that only next-step hints, which highlight faults in incorrect specifications and provide tips on how to fix them, improved the immediate performance of the participants without jeopardizing learning retention.

Next-step hints are one of the most common feedback approaches in ITSs for programming [21]. A possible approach to generate these hints is through automated repair techniques. After repairing a faulty program to obtain a correct one, a next-step hint can be obtained by comparing both. One such technique has been proposed for Alloy [4], but it is only effective when students are already close to a correct specification, and the quality of the generated hints is not clear. An alternative approach is to rely on historical student submission data for the generation of hints, in order to guide the student towards paths that led to successful submissions. The expectation is that more understandable hints can be generated by mimicking successful peer behaviour.

This work proposes the first history-based hint-generation technique for Alloy, and presents its implementation as an extension to Alloy4Fun. Alloy4Fun was also designed to support research on formal methods education, and thus every interaction with the tool is anonymously recorded and made available to the instructors [16]. Based on this collected data, the proposed extension creates a directed graph encoding all attempts by previous students. Then, upon a hint request, it finds a path between the student submission and a solution using a customizable policy, and generates a next-step hint based on this path. The developers of Alloy4Fun maintain a publicly available dataset [15] of student attempts collected from their classes over the years. We relied on this dataset to evaluate our technique both for performance (effectiveness and efficiency) and for the quality of the hints (based on the opinions of experts on teaching Alloy). It achieved better results than the state-of-the-art tools. Furthermore, it can generate timely feedback, which is especially important in the educational context since students might easily feel frustrated if hints take too long to generate.

The remainder of the paper is structured as follows. Section 2 provides a short introduction to Alloy education, and Sect. 3 describes techniques for hint generation and Alloy repair. Section 4 presents our solution and its implementation, which is evaluated in Sect. 5. Section 6 presents conclusions and future work.

2 Teaching Alloy with Alloy4Fun

Fig. 1.
figure 1

Social network model with specification challenges

The Alloy language is based on temporal relational logic, but for simplicity, we’ll restrict this presentation to the static subset of the language. Structure in an Alloy model is introduced through the declaration of signatures and fields. These can be restricted by multiplicity constraints and be hierarchically organized. The upper part of Fig. 1 depicts the structure of a social network system, a simplified version of an exercise in the Alloy4Fun dataset [15]. A signature models users, with binary fields and relating each user with a of users being followed, and a of posted photos, respectively. Signature extends users, denoting a subset of . Signature has a field that relates each photo to exactly day when it was posted; advertisements are a particular kind of photo, introduced by sub-signature .

When validating a system design, one would impose additional restrictions over this model using temporal relational logic through facts. To promote maintainability, reusable formulas and expressions can be introduced through predicates and functions, respectively. Then run and check commands would be defined to animate the model or verify desirable properties, respectively. Commands are automatically executed by the Alloy Analyzer within a given bound for the universe. When teaching Alloy, a typical kind of challenge presented to students is to encode some of these logical constraints.

Fig. 2.
figure 2

Incorrect submission to in Alloy4Fun

With this in mind, Alloy4Fun introduced the concept of model secret, allowing such challenges to be auto-graded [16]. Instructors write an oracle as a secret predicate and then use the Analyzer to check whether a student submission is equivalent to it. Two examples are shown in the bottom of Fig. 1. The student is asked to write in predicate the constraint “every photo is posted by one user”. Hidden from the student through annotation , predicate specifies a possible solution: for every photo , there is exactly one user related with it through . Command simply checks whether the student specification and the oracle are equivalent (with at most 3 atoms in each signature). Being a semantic test, the correct submission can be syntactically different from the oracle. A single Alloy4Fun model (which we call an exercise) can contain multiple challenges; the one in Fig. 1 has 2.

If a check command is invalid, the Analyzer (and Alloy4Fun) returns a graph-shaped counter-example where the equivalence does not hold. The user can navigate through alternative counter-examples and customize the visualization for better comprehension. As an example, Fig. 2 shows the student view of the exercise from Fig. 1 (i.e., secrets are hidden), where the student submitted an incorrect attempt to the challenge and a counter-example was returned. In principle, counter-examples are helpful when debugging specifications, but studies show they are not the most adequate feedback for novice users [6].

Alloy4Fun collects anonymous data from all user interactions. So, whenever a student runs a command, it stores information such as the full model, the selected command and its outcome, and the identifier of the model it derived from. The resulting derivation tree allows the reconstruction of student paths, by identifying sequential attempts to the same challenge. The already mentioned dataset [15] collects this data for various editions of formal methods courses in the Universities of Minho and Porto, Portugal, between the Fall of 2019 and the Spring of 2023, totalling about \(100\,000\) models.

3 Automatic Hint Generation

Next-step hints Although next-step hints are a popular kind of feedback in ITSs, there are some concerns that such hints may be counter-productive, namely due to hint abuse and avoidance [1], or the fact that they indicate students ‘how’ to fix rather than ‘why’ [18]. Nonetheless, studies [10, 14, 25, 26] suggest that next-step hints have no impact on long-term learning retention but often improve immediate performance, enabling students to learn more efficiently. A recent study on Alloy reached similar conclusions [6]. Moreover, there’s an indication that accompanied by prompts for self-explanation, such hints may improve learning retention [20], although the results could not be replicated [19].

There are several techniques to automatically generate a next-step hint from an incorrect submission [21]: searching for steps that take the student closer to a reference solution, using previous successful submissions by peers, identifying known patterns in the incorrect submission, or trying to repair a solution to pass an oracle. Repair-based approaches have been proposed for Alloy, which we discuss below. However, these are often affected by scalability issues, and it’s unclear how to select high-quality hints from alternative repair suggestions. In contrast, data-driven approaches do not suffer from performance issues and may generate more intuitive hints since they are based on historical submissions. The tradeoff is that they may be ineffective in large solution spaces or assignments with small historical logs. We are not aware of such techniques for specification ITSs, so we discuss them in the context of programming ITSs next.

Data-driven hint generation. The first data-driven hint generation approach was proposed in the context of a logic-proof tutoring system [2]. It has since been adapted to platforms for programming [11, 23, 28], although not for specifications, as far as we are aware. The main idea behind these approaches is to use historical student submissions to build a graph of all traversed solution paths. Each node in the graph is the AST of a submitted attempt in a student path, and the transitions register the sequence of edit actions that lead from one submission to the other. To build the hint graph, all student paths are combined into a single graph by matching identical submissions, keeping the popularity of each state and/or transition, and marking correct submissions as goal states. When a student asks for a hint, if the current state is present in the hint graph, it calculates the optimal path towards a correct solution and generates a hint. In [2] Markov Decision Processes (MDP) were used to calculate the optimal path, but various other policies have since been proposed [22, 24]. Studies have used expert input to evaluate the quality of the hints resulting from different polices [22, 24].

The main challenge for this kind of approach is the size of the solution space. Besides being an obvious issue for assignments with little historical data, the solution space for expressive programming languages is so large that getting hits in the graph may be unlikely even with substantial historical data. Several approaches have been explored to address this, such as creating intermediate states [28], using program outputs rather than the actual AST as graph states [11], or employing canonicalization techniques to group semantically equivalent ASTs in the same graph state [27].

Automated Alloy Repair. Automated program repair techniques generate fixes for programs that fail to pass a certain oracle. In education, this oracle can be written by the instructor, either a reference solution or a suite of tests, and then used to generate hints to fix student submissions. Some automated repair techniques have been proposed for Alloy specifications [3, 4, 30, 31].

ARepair [30] was the first repair technique for Alloy, using test cases as oracle. This makes it prone to overfitting, generating fixes that pass the tests but still break the expected properties. Moreover, Alloy models are typically not accompanied by test cases. In contrast, BeAFix [3] uses as oracles check commands. This is more natural in Alloy (and Alloy4Fun challenges) since models are typically accompanied by commands defining expected properties. Unfortunately, the pruning techniques proposed to improve performance rely on multiple commands and suspicious locations, and are not effective for simple Alloy4Fun specification challenges. TAR [4] was developed for the educational context and integrated into Alloy4Fun. It is focused on producing timely feedback to avoid student frustration (and to support the temporal aspects of Alloy 6). Its pruning technique evaluates previously seen counter-examples to avoid costly calls to the solver. It was shown to considerably outperform ARepair and BeAFix within a 1-minute timeout, but it is unfeasible for specifications far from a correct solution. ATR [31] is another technique to repair Alloy 4 specifications with commands as oracles. Although developed independently from TAR, it also uses counter-examples (and the closest valid instances) to avoid calls to the Analyzer. ATR was shown to outperform the repair rate of ARepair and BeAFix, and to be more efficient than BeAFix.

4 Hints from Historical Alloy Data

The proposed technique adapts existing data-driven hint generation techniques for programming. Using Alloy4Fun historical data, it creates a graph that captures students’ progress when solving a challenge, which is then used to generate hints for future students. This section describes the technique and its implementation, whose overview is presented in Fig. 3.

Fig. 3.
figure 3

Overview of the approach when submissions are present in historical data

4.1 Hint Graph Construction

To generate hints, our approach relies on a graph of student submissions for each specification challenge, created from an Alloy4Fun dataset. These graphs are created offline and can be rebuilt from time to time as new data is collected. Each node in the graph is a normalized formula previously submitted by a student, labelled as correct or incorrect, and each edge represents a transition between two submissions. Each formula is unique in the graph, so similar submissions are merged, and the frequency of nodes and transitions are registered to be used in the pathfinding step. Formula comparison is performed at the AST level, so syntactically incorrect entries in the dataset are disregarded. As seen in Sect. 2, an Alloy4Fun exercise may contain multiple challenges, so the derivation tree must be split per challenge. The Alloy command called by each entry identifies the corresponding target challenge. To exactly identify the student submission and avoid considering the oracle as part of the graph state, we assume that each challenge command calls an empty predicate to be filled by the student, as exemplified in Fig. 1; the formula for each node is extracted from the content of that predicateFootnote 2. When extracting submissions to a certain challenge and removing syntactically invalid formulas, the pointer to the parent submissions must be updated accordingly to preserve the student paths.

For improved efficacy (i.e., the probability of a submission having a match in the graph), we apply a few canonicalizations specified in [27] that were sensible in the Alloy context, such as sorting commutative operations and normalizing the direction of comparisons. Additionally, since quantified variables in Alloy cannot be inlined, we apply variable anonymization. The same transformation is applied to submissions whenever a hint is requested. Note that we do not want to abuse canonicalization and end up with hints for a formula that differs too much from the concrete student submission. So, for example, we do not propagate operators using De Morgan’s laws.

Fig. 4.
figure 4

A sample derivation tree with 3 paths for the exercise in Fig. 1

Fig. 5.
figure 5

Hint graphs resulting from the derivations in Fig. 4

To illustrate this process, consider the derivation tree in Fig. 4, that could be collected from the exercise in Fig. 1 (signature and field names abbreviated). It contains 3 paths, with incorrect and correct interleaved attempts to both challenges ( and ). The target challenge in each state is the one not greyed-out, green and red nodes represent correct and incorrect submissions, respectively, and the blue node is the root model shared by the instructorFootnote 3. This will result in the two graphs in Fig. 5, with node and transition frequency identified by line thickness. Notice the normalization before merging, here just the name of the quantified variables. Notice also that there may be more than one semantically equivalent valid solution per challenge.

4.2 Finding the Optimal Next State

The hint generation algorithm runs on demand when a student requests a hint. After locating the student’s submission in the hint graph of the target challenge, the current state, the algorithm searches for the optimal path—according to the defined criterion—from it to any correct formula, the goal state. The first edge of this path indicates the transition the student should make to progress toward the goal, the next state that will be used to create the hint.

As discussed in Sect. 3, several criteria have been proposed to define the optimal path. Our goal was to keep the path finding process as general as possible, so we allow the instructor to define the desirable policy. This is done through the definition of a weight function on the edges of the graph from a set of available attributes. These attributes may be data-driven—namely the (relative) popularity of the edge in the source state, and the popularity of the source and target states—but also syntactic—namely the complexity of the edge transformation and the source and target formulas. The complexity of the states is given by the size of the respective AST. For the complexity of the edge, recall that a transition between states may encompass several actions between two successive submissions from the student. We measure the complexity of the edge as the tree edit distance (TED) between the two states, calculated using the state-of-the-art algorithm APTEDFootnote 4.

Given the weight function on edges, the optimal path is calculated through a simple shortest path algorithm for weighted graphs.

4.3 Hint Message Generation

Fig. 6.
figure 6

Example of AST edit operations.

The next-step hint is generated from the optimal path. We consider two aspects to create the hint message: how far the student is from the optimal solution, based on the TED between the current and the goal states; and the sequence of edit operations between the current and the next states. To calculate this sequence, we use an implementationFootnote 5 of GumTree [9], which calculates a mapping between AST nodes and uses the Chawathe et al. [5] algorithm for computing the edit sequence. The result is a sequence of inserting, deleting, or moving nodes, or updating a node’s label. Since there may be dependencies between these edit operations, currently we select the first operation of the sequence for the hint. To translate an edit operation to a hint we use a message template for each operation type. The messages try to simulate what a teacher would say to a struggling student, and contain placeholders for operator-specific information that can be tailored for the Alloy language.

Consider, e.g., transforming , incorrect for , into the correct , shown in Fig. 6. This requires 4 operations: move node up, delete nodes and , and update node to , resulting in a TED of 4. The resulting hint message looks like this: “Keep going! It seems like you have unnecessary information in your expression. Try simplifying your expression by deleting the difference operator ( ).”.

4.4 Handling Missing Hits

A pure data-driven approach fails for formulas absent from the historical data. To improve efficacy, one can construct paths from a previously unseen state until one already in the graph. To this purpose, we enhance our data-driven approach with a mutation-based component. Whenever a request does not exist in the graph, we generate variants according to a set of mutators. If a variant happens to already exist in the graph, a temporary edge from the current state to that variant is added with popularity 0, thus connecting the previously unseen formula to the graph and enabling the pathfinding procedure. These mutators—which are comprised by multiple edit actions—represent typical high-level transformations applied to a formula. In particular, we rely on the mutators proposed by TAR [4], which were specifically designed for the Alloy language. Currently, this process is restricted to a single mutation to avoid reaching a path too distinct from the student submission.

4.5 Deployment in Alloy4Fun

Fig. 7.
figure 7

Hint provided for incorrect submission to in extended Alloy4Fun

The proposed approach was implemented as a REST service, and we implemented an extension to the Alloy4Fun platform that uses the service to automatically provide hints to challenge attemptsFootnote 6. A new button was added to the interface that allows users to request a hint when an incorrect specification is submitted to a challenge. If the tool is able to generate a hint, it highlights a location in the editor and provides an explanatory message. This is shown in Fig. 7 for the example used in Sect. 4.3.

The service was implemented in Java—to take advantage of the Alloy Analyzer parser and AST—using the Quarkus framework. The hint graphs are stored in a new collection for the MongoDB database of Alloy4Fun. The weight function that determines the policy is provided through a JSON file that defines an arithmetic expression over the complexity and frequency attributes presented in Sect. 4.2.

Although optimal paths could be calculated live from the graph whenever a hint is requested, in practice, to make hint generation as fast as possible, we pre-compute the optimal next state for every state of the graph offline. When a hint is requested, it is just a matter of fetching the next state from the graph.

Table 1. Statistics for the considered exercises
Table 2. Quantitative evaluation results, all times in seconds
Table 3. Incorrect specifications selected for the questionnaire

5 Evaluation

We evaluate the proposed hint generation technique quantitatively—addressing its effectiveness and efficiency—and qualitatively—comparing the generated hints with those suggested by experts. Specifically, we aim to answer the following research questions:

  • RQ1 How effective is the tool when a hint is requested, i.e., how often can it generate a hint?

  • RQ2 How efficient is it in the various steps of the process, i.e., how long does it take to construct the graph and to generate a hint?

  • RQ3 How does it compare with repair-based approaches?

  • RQ4 What is the quality of the generated hints, and what is the impact of the specified policy?

Table 4. Most popular answers by expert Alloy tutors

5.1 Quantitative Evaluation

For the quantitative evaluation, we applied our technique to the Alloy4Fun dataset [15], which contains data for multiple exercises (each with multiple challenges). It contains about \(66\,000\) syntactically correct student submissions to 12 different exercises, collected over 4 years. Table 1 shows the number of challenges per exercise (Challs.) and the aggregated statistics. The dataset was split into a training subset to construct the graphs and a testing subset to evaluate the performance. We split full paths in the dataset randomly 70%/30% (rather than splitting individual submissions, since our approach is based on previously traversed paths). Each entry in the testing subset was then run for a hint request in the purely data-driven technique, in the version that employs mutations for formulas absent in the graph, and also in the existing repair-based approach TAR with a maximum search depth of 2. All tests were performed on a commodity Intel Core i5-13600KF, with 32 GB of RAM. Timeout for requests was set to 1 min since timely feedback is critical in the educational context. Table 2 summarizes the results.

Regarding RQ1, Table 2 shows the hit rate (i.e., the number of specifications for which the tool was able to return a hint) for the purely data-driven and the mutation-enhanced versions. The hit rate of the former ranges from 19% to 56%, with a total average of 39%. Interestingly, the exercises with higher hit rate are not among those with the largest number of specifications in the historical log, which is possibly connected to the complexity of the challenges. Nonetheless, this hit rate will only increase as the exercises collect more submissions. Activating the mutation component for missed requests considerably increases the hit rate to an average of 57%.

For RQ2, we start with the graph construction step. Table 2 aggregates the results for each exercise, namely the number of unique formulas resulting in graph states, and the time to construct the graphs (\(T_G\)) and to compute the optimal next state (\(T_P\)). The selected weight function did not affect the performance significantly (shown values are for minimizing transition complexity). Results show that the whole process takes a few minutes for the exercises with more submissions, which is reasonable since this construction is expected to be performed sporadically offline. Regarding the hint generation step, Table 2 also shows the average time to generate a hint for both approaches (\(T_H\)). For the data-driven approach, this time is negligible for all exercises (recall that we pre-calculate the optimal next state offline). When enhanced with mutations, there is an expected increase on time, although still below 1 s in average. This makes the technique feasible in answering live hint requests.

Regarding RQ3, Table 2 also shows the hit rate and time to retrieve a hint for TAR. The hit rate seems less predictable, ranging from 9% to 87%, with an average of 30%, well below our approach. Interestingly, the number of formulas for which both our data-driven approach and TAR can generate hints (Cmn.) is very small, suggesting that these approaches are complementary. As expected TAR takes considerably longer to generate a hint, with an average of 27 s, since it is search-based and calls the solver to validate potential solutions.

5.2 Qualitative Evaluation

To evaluate the quality of the generated hints (RQ4), we asked experienced Alloy instructors how they would suggest a next-step hint for a set of incorrect specifications. For each of the two challenges from Fig. 1, we selected 3 frequently submitted incorrect specifications, shown in Table 3. We created a questionnaire that asked for hints in the shape of a target location and an edit operation (insertion, removal and update). We sent the questionnaire to 12 Alloy instructors unrelated with this work, and received 8 replies. We observed that, except for one case (I1a), the experts did not select the same next-step hint, highlighting the difficulty of automatically generating hints. Table 4 shows the most popular answers by the experts, both by location only by the whole hint (i.e., location plus edit operation).

Our approach allows policies to be customized through weight functions. To compare the answers of the experts with the results of our approach, we designed a few simple weight functions, some considering only the complexity of nodes (\(Cmp_N\)) and edges (\(Cmp_E\)), and others only the frequency of nodes (\(Frq_N\)) and edges (\(Frq_E\)). We also considered a couple of policies that combined these syntactic and data-driven attributes. For this evaluation, we do not consider the mutation-enhanced version of the technique, as we intend to evaluate the quality of the data-driven approach. For each policy we counted in how many of the 6 incorrect specifications the generated hint: i) was selected by any expert, and ii) was among the most popular answers by the experts. We consider whether there was a match only on the identified location or in the whole hint. Table 5 shows the results.

Interestingly, results show that looking uniquely at the complexity of the edges (TED) results in hints closer to the experts than the purely data-driven policies. However, the best results are actually when considering both kinds of attributes simultaneously: with \(Cmp_E\) and \(Frq_E\) every hint generated was one also suggested by some experts, and often one of the most popular.

Table 5. Matches between hints generated by policies and expert hints

6 Conclusion

This paper presented the first data-driven hint generation technique for ITSs for learning formal specifications, namely for the Alloy language, and its implementation in the Alloy4Fun platform. The data-driven technique is complemented with a mutation-based component to handle absences in the historical data. Our evaluation shows that our approach outperforms an existing repair-based technique, and that with the right policy the generated hints can emulate those provided by experts.

Our expert questionnaires included an open question where most experts suggested feedback in shapes other than next-step hints, such as explaining the issue with the incorrect specification. Some studies suggest next-step hints accompanied by self-explanations can improve learning [20], but studies also find hints explaining issues are not well-received by novices [6]. Further studies are needed on how to implement these effectively. On the other hand, the quantitative evaluation showed a small overlap between the cases successfully handled by the data-driven and the repair-based approaches, suggesting that hybrid approaches may be worth exploring.