Keywords

1 Introduction

Effective requirements management ensures software projects deliver products that meet customer needs and fulfill business goals [1, 17]. In agile software projects, requirements, typically in the form of user stories, are elicited and saved in the product backlog early in the project for planning and prioritization. They are discussed and refined with the customers before and during the implementation. The quality of user stories [1, 6,7,8, 10] directly influences the development cycle’s velocity and the fulfillment of customer expectations. However, ensuring the completeness, consistency, unambiguity, testability, etc. of user stories, i.e. good user stories, presents challenges.

As agile methodologies emphasize rapid iteration and adaptability, the potential of large language models (LLMs) to enhance user story analysis is becoming increasingly significant. The advanced natural language processing capabilities of LLMs present a promising potential for automating the improvement of user story quality. By refining and generating user stories, LLMs can provide substantive assistance to product owners, developers, test engineers, etc. in requirements management.

This study investigates the feasibility of leveraging LLM agents to automate the enhancement of user story quality within agile software development settings. We implement an Autonomous LLM-based Agent System (ALAS) for user story quality improvement and report on its initial deployment in a mobile delivery project at Austrian Post Group IT. By assessing the impact of ALAS on user story quality in six agile teams within the organization, our findings contribute to the emerging discussion around AI’s role in agile software development, demonstrating a proof of concept for LLM’s potential to address complex industry challenges.

2 User Story Quality

User stories are short and abstract descriptions to define high-level requirements [5]. They serve mainly as anchors for future discussion and refinement in the software development process. A widely accepted template for user stories is: “As a [role], I want [requirement] so that [benefit]”. This effectively includes the core elements such as the intended user (role), the desired system functionality (requirement), and, optionally, the underlying rationale (benefits). Additionally, every user story should be accompanied by a set of acceptance criteria (AC) that outline detailed conditions a user story must meet to be considered complete and acceptable, including functional behavior, business rules, and quality aspects to be tested. The AC makes a user story more concrete and less ambiguous [5].

Writing good user stories is essential in software projects, as they convey the needs and perspectives of users and guide the development team in implementing the expected functionalities. Beyond general guidelines for quality in requirements engineering, such as ISO/IEC/IEEE 29148-2011 [1] and IREB guidelines [8], various frameworks include a set of criteria for assessing the quality of user stories. For example, the INVEST framework [3] is widely used in agile projects as a guideline for ensuring the quality of user stories. It includes attributes such as independent, negotiable, valuable, estimable, small, and testable. These attributes adhere to industry standards [1, 8] that ensure user stories are concise, clear, and achievable, and contribute to the success of software development projects and positive user experiences.

Despite the widespread adoption of user stories and available criteria for good user stories, the methods for assessing and enhancing their quality are still relatively limited. Recent research has increasingly focused on leveraging LLMs to assist in requirements engineering tasks [11]. For example, White et al. [16] introduced a catalog of prompt patterns for stakeholders to interactively evaluate the completeness and accuracy of software requirements. Ronanki et al. [14] conducted a comparative analysis between ChatGPT-generated requirements and those specified by requirements experts from both academia and industry. The results revealed that LLM-generated requirements, while abstract, are consistently understandable. This indicates the potential of LLMs, like ChatGPT, in automating various tasks through its NLP capabilities. While interest in applying LLMs to engineering tasks is growing, research on their industrial implementation and performance evaluation remains limited. This gap forms the goal of our study, which aims to connect the theoretical potential of LLMs with practical application to gather feedback from industrial professionals.

3 Implementing an Autonomous LLM-Based Agent System (ALAS)

An agent-based system’s strength lies in the AI agents’ ability to communicate and execute tasks, thereby facilitating the automation of software development tasks. Implementing such a system, is a pivotal step in harnessing LLMs to assist with requirements management. Our Autonomous LLM-based Agent System (ALAS) was designed to automate AI agents’ collaboration across various requirements management scenarios. The implementation includes two phases: task preparation and task conduction. An example of the two phases is illustrated in Fig. 2 in Sect. 4.

The task preparation phase aims at formulating prompts, enabling agents to understand their roles and expected contributions to the task. These prompts define the actions every agent is expected to perform at each step. There are two categories of prompts: initial prompts and follow-up prompts, as follows.

$$\begin{aligned} \begin{array}{c} {\boldsymbol{Initial}}\ {\boldsymbol{Prompt}}_{{\boldsymbol{i}}} = {\boldsymbol{Profile}}_{{\boldsymbol{i}}} \ {\boldsymbol{+ Task + Context + Subtask}}_{{\boldsymbol{i}}}, \textit{(1}\le \textit{i}\le \textit{k)}\\ {{\boldsymbol{Follow}}\hbox {-}{\boldsymbol{up}}\ {\boldsymbol{Prompt}}_{{\boldsymbol{i}}} = {\boldsymbol{Subtask}}_{{\boldsymbol{i}}}\ {\boldsymbol{+ Response}}_{{\boldsymbol{i-1}}}}, \textit{(i}>\textit{k)} \end{array} \end{aligned}$$

where Profilei: Agenti’s profile; Task: Task to complete; Context: Background information where the task is situated; Subtaski: Subtask i; Responsei-1: Response produced after completing Subtaski-1.

The initial prompts (Prompti, 1\(\le i \le k\)) are created by concatenating strings that describe an agent’s profile, the overall task, its context, and the agent’s first subtask, as indicated by the "+" in the equations. This ensures that the k participating agents understand both their individual roles and their expected contribution to the overall task completion. After initial prompts familiarize agents with the task and their roles, the follow-up prompts (Prompti, i>k) are dynamically constructed by combining the specific subtask with the output from the previous subtask. This helps maintain continuity and coherence in the progression towards completing the overall task.

It is an iterative process to formulate and optimize prompts to ensure agents communicate effectively to produce the desired output. Various prompt patterns and techniques can be applied, such as the use of persona pattern [16] to create a Profilei for each agent, the k-shot prompt [4] to provide instructions or examples of the desired output, the AI planning [15] to generate a task breakdown plan and the assign Subtaski to the responsible agent, the fact checklist pattern [16] to verify the output, etc.

In the task conduction phase, agents dynamically collaborate, using prompts to guide their actions and execute subtasks. This is an iterative and incremental process, like what agile teams perform in software projects. Agents sequentially tackle subtasks by following the structured prompt. The use of the previous response in the current prompt ensures that each agent’s response is relevant and builds upon the previous work. This iterative collaboration is like the daily stand-ups and sprint reviews in Scrum, where each team member’s work is informed by the overall sprint progress. At the same time, the prompt structure ensures that the task evolves dynamically with each agent’s previous response, reflecting the adaptive and responsive nature of an agile project where plans and tasks are continuously refined based on ongoing feedback and developments. The final output is incrementally generated based on the agents’ responses.

4 Experiments

Following the implementation of ALAS, we evaluated its effectiveness in improving user story quality within agile teams at Austrian Post Group IT. The company has multiple teams, working synchronously across numerous systems and applications orchestrated within Agile Release Trains [12]. User stories play an important role in planning and prioritizing the implementation of these systems, facilitating communication and collaboration across diverse teams. High-quality user stories are essential for successful development projects. Recognizing this criticality, we assess the impact of ALAS on user story quality improvements.

4.1 Setting up Experiments

The experimental setup, i.e. task preparation phase, is an iterative process of creating and refining prompts that describe a task alongside its context, define the agents’ profiles, and plan subtasks.

Task and Context of Task. The task aimed to enhance the quality of user stories for a mobile delivery project, ensuring they meet organizational standards and align with business objectives. Example user stories are available on the questionnaireFootnote 1 used in the evaluation. These user stories required enhancement in clarity, completeness, correctness, consistency, and relevance to the application’s overall functionalities. Specifically, to provide the contextual background about the task, we added two documents: a minimum viable product (MVP) document that details the basic features of the mobile delivery application, serving as a blueprint to guide agents in refining user stories in a way that resonates with core product features; and a product vision statement, structured using the NABC (Needs, Approach, Benefit, and Competition) value proposition template [2]. This document provides a strategic overview of the application, addressing client’s needs, the proposed solution, client benefits, and unique value propositions. Together, these two documents equip agents with the necessary technical and strategic context to align their efforts with the project’s goals.

Agent Profiles. To set up the experiment, Austrian Post Group IT identified two main focus roles: the product owner (PO) and the requirements engineer (RE). This led to the creation of two distinct agent profiles. The Agent PO understands the vision of the project. It is responsible for managing product backlog and prioritizing user stories based on business value and customer needs. This agent ensures user stories align with the overall product strategy and objectives. Agent RE concentrates on the quality of user stories. It ensures user story description is unambiguous, and the acceptance criteria are measurable. This is crucial for verifying that the story fulfills its objectives upon implementation.

The agent profiles are designed to reflect the actual functions of POs and REs in agile teams within the company, developed through an iterative process to ensure that agents not only understand their specific tasks but also execute them with a high level of expertise and in a manner that is conducive to collaborative software development. The profiles include role definition and expectation, key responsibilities, practical tips, and tone adjustment. An example of the Agent PO’s profile is illustrated in Fig. 1.

Fig. 1.
figure 1

An example excerpted from the PO profile

Subtasks. After specifying the task and identifying the participating agents, our next step involves detailing the sequence of interactions between these agents. To achieve this, we used an AI plan pattern [15] to generate a comprehensive list of key steps and subtasks for task completion, as well as the identification of the responsible agents. This plan was further reviewed and refined by a Scrum master and a PO in agile teams, ensuring that it aligns with the company’s agile framework, common practice for requirements management, and project objectives. Figure 2 visualizes the complete structured conversation flow between the two agents and their subtasks in the task conduction phase, i.e. the collaborative and iterative interaction between agents in the user story improvement process.

Fig. 2.
figure 2

AI plan illustrated in the task conduction phase

4.2 Evaluation

When the experiment was set up, ALAS was deployed to improve the quality of user stories for the mobile delivery application. To evaluate its effectiveness and identify opportunities for refining our approach, we designed a questionnaire based on the INVEST framework [3]. Table 1 shows the statements for rating and the corresponding characteristics of good user stories from the INVEST framework. Considering the time required for participants to complete the survey, the questionnaireFootnote 2 includes only six user stories: two originals and two ALAS-generated improvements for each. One improvement was generated using the gpt-3.5-turbo-16k model, and another using the gpt-4-1106-preview model. Participants rated user stories against the statements using a Likert scale from 1 to 5, where 1 indicates strong disagreement and 5 indicates strong agreement. Additionally, the questionnaire includes two open-ended questions for each user story to collect participants’ feedback on specific improvements, concerns, and suggestions for further improvements. Finally, participants provide an overall satisfaction rating and recommend the stories most suitable for the project.

Table 1. Statements and the corresponding characteristics of good user stories from the INVEST - Independent, Negotiable, Valuable, Estimable, Small, and Testable

5 Results

Our survey collected 12 responses from six agile teams at Austrian Post Group IT, involving two POs, four developers, a test manager, a Scrum master, a requirements analyst, two testers, and a train coach. Notably, 10 out of 11 participants have been working at the company for over two years, with 9 more than five years of experience in agile projects. Their expertise provided a solid foundation for evaluating the user stories in the survey. The participants dedicated an average of 33 min to complete the questionnaire.

The survey participants reported their concerns about User Stories 1 and 2 (US1 and US2), criticizing both for ambiguity, particularly in AC which failed to describe the conditions for evaluating whether a story is complete. In addition, the business value in these stories remained vague. Specific scenarios of error handling in US1 were not adequately addressed.

The improved versions, US1(v.1) and US2(v.1), generated by GPT-3.5-turbo agents, exhibited improvements in clarity, comprehensibility, and narrative flow, making the user stories more coherent. However, two participants highlighted that the new titles for user stories were overly creative and the description in AC should be more detailed. In addition, concerns remained in the AC about scenarios such as multiple printer connections identified in US1.

Improvements generated by the GPT-4 model, i.e. US1(v.2) and US2(v.2), were praised for their comprehensive content and clearer expression of business value. Specifically, the AC for US1(v.2) has been improved, clearly resolving the ambiguous printer connection issues in US1 and US1(v.1). However, the added detail and clarity resulted in longer and more complex user stories, which can undermine their practical applicability - six survey participants noted concerns about user story descriptions being too long.

Table 2. Average ratings (1–5 Scale) of overall satisfaction and quality of user stories.

Table 2 summarizes average ratings for overall satisfaction and quality aspects of the user stories. Both US1(v.1) and US1(v.2) scored an average overall satisfaction of 4, while US2(v.2) scored 3.71, higher than US2(v.1). This preference is confirmed by 7 participants choosing US1(v.2) and US2(v.2) for the project. However, despite their merit on sufficient description (S4), both USs rated lower in simplicity, brevity, and appropriate level of detail (S1, S2, and S3), particularly struggling with their size, scoring averages of 3 and 3.17 respectively. Notebly, US2(v.2) received the most disagreements regarding its size, with 5 participants marking “Disagree”. This disparity may potentially affect the user story’s comprehensibility(S1). US1(v.2) also received a minor drop in technical achievability (S5), compared to US1(v.1). These results highlight concerns over the increased length and complexity of user stories generated by the GPT-4 model, significantly affecting the satisfaction level of these user stories, a sentiment corroborated by the survey results.

6 Discussion

Our experiments with ALAS have demonstrated promising results in enhancing user story quality, particularly in terms of clarity, specificity, and business value articulation. This is evident from the increased overall satisfaction ratings and the textual feedback by survey participants.

Despite these enhancements, agents’ ability to learn from context, while impressive, reveals a gap in aligning with project-specific contexts and requirements. Feedback from one developer highlighted that US1(v.2) included an authentication process that, while relevant to the story, “seems to be out of scope of the US1”. Similar feedback was observed from another developer’s feedback. These imply that certain requirement quality aspects might be missing or inadequately defined in the prompts. Therefore, careful prompt crafting and rigorous review by human experts, like the PO, are crucial. For ALAS to be effective in specific tasks, involving the PO and domain experts in the task preparation phase is vital to tailor prompts for optimal outcomes. Moreover, a quality analyst agent can be added to monitor the scope, level of detail, and relevance of story description, simulating agile project practices.

In examining the parameters governing GPT models, particularly the ’Temperature’ parameter that stimulates creativity, we observe a double-edged sword. While it boosts novel and diverse content generation, it also increases the risk of AI hallucination [13], which can lead to plausible yet inaccurate or irrelevant outputs. In our experiments, we set the medium value 1 for Temperature. However, this still poses a challenge in maintaining factual accuracy, emphasizing the need of incorporating techniques such as the retrieval-augmented generation (RAG) [9] to mitigate the risk of irrelevant content generation.

7 Conclusion

In this study, we presented ALAS, which integrates GPT models as agents to enhance requirement quality in agile software development. The initial findings showed that ALAS improves user story clarity, comprehensibility, and alignment with business objectives. However, the findings also highlighted the indispensable role of human intelligence, particularly the PO in software projects, in monitoring the stories’ improvements to ensure the integrity of automatically produced outputs. This study contributes a proof-of-concept for AI-assisted user story quality improvement. Although the evaluation is limited to two user stories, it marked a promising step forward in bridging the gap between AI capabilities and human expertise in software development.