1 Introduction

Requirements are crucial in product development as they serve as a basis for communication among stakeholders, teams, and companies (Loucopoulos 2005; Gericke and Blessing 2012). Specification sheets are commonly used to capture project requirements, aiming for a complete, unambiguous, and contradiction-free system definition (DIN 69901-5:2009-01 2009; Bender and Gericke 2021). However, these sheets often contain thousands of requirements in natural language, derived from interdisciplinary collaboration, typically organized and managed in separate documents for different levels of the product (VDI 2221 Blatt 1:2019-11 2019). The complexity of modern systems and the need for distributed and concurrent development across various levels further complicate the process, which is why the requirements are typically organized and managed in multiple documents (Göhlich and Fay 2021). As a result, errors, including contradictions, are frequently found in these documents. Currently, there is a lack of comprehensive automated tools for detecting contradictions and performing quality analysis on industrial specification documents.

Product development is a complex and dynamic process that requires proper management and understanding of the requirements. Multiple studies have demonstrated the critical importance of high-quality requirements in the success of development projects. In the realm of requirements quality research, the attention is centered on individual characteristics of requirements, such as completeness, complexity, ambiguity, and consistency (Montgomery et al. 2022; IEEE/ISO/IEC 29148-2018 2018). This paper focuses on addressing inconsistencies in the form of contradictions, which can potentially cause misunderstandings and product defects.

Natural Language Inferencing (NLI) is a commonly encountered problem in the field of natural language processing (NLP), in which the objective is to determine the nature of the relationship between a premise and a hypothesis, both of which are represented as sentences (Jang et al. 2020). In recent years, NLP has become more sophisticated, with machine learning models capable of handling complex tasks such as question answering, text extraction, and sentence generation. It was shown that extending NLI to manage sentence relationships could have significant implications for scientific text analysis during product development (Ritter et al. 2008).

We emphasize formal requirements written in natural language that adhere to specific formulation guidelines, particularly using single-sentence requirements that follow the principle of atomicity. For example, employing standard sentence templates for requirements can ensure adherence to these guidelines. Such templates are widely used across different industries and sectors involved in hardware, software, or system development, especially for lower system levels (Dick 2017; GmbH 2016; Wiegers and Beatty 2013; Robertson and Robertson 2013). This approach facilitates consistency and minimizes errors arising from linguistic nuances.

Various solutions have been proposed to tackle this challenge, including knowledge graphs and thesauri. Knowledge graphs are formal representations of concepts, ideas, or objects about a specific domain and all the relationships between those concepts. Thesauri group together words with similar (synonyms) and opposite (antonyms) meanings (Ahmad et al. 2020). While these solutions can be helpful in identifying contradictions in requirements, they have limitations in their practicality, as we will discuss further in Sect. 3.

Building upon our foundational work (Gärtner et al. 2022) which introduced a taxonomy for detecting contradictions and proposed a semi-automated approach, we made further strides in automation in Gärtner et al. (2023). Our contribution involved developing a method to identify conditions and other sentence components within requirements. This advancement builds upon our previous framework, emphasizing our ongoing commitment to enhancing the automated detection of contradictions in requirements engineering.

Our current research introduces ALICE (Automated Logic for Identifying Contradictions in Engineering). This system synergizes the capabilities of formal logic with large language models (LLMs) to automatically detect and resolve contradictions between two requirements.

Formal logic is a branch of mathematics that deals with the precise rules and structures for reasoning and inference. On the other hand, LLMs are artificial intelligence (AI) models that use machine learning algorithms to learn patterns and relationships between tokens in large natural language datasets. We will elaborate further on this in Sect. 2.2 and explain why LLMs are needed to detect contradictions. Combining these two solutions offers a unique approach to identifying and resolving contradictions automatically, leveraging the capabilities of formal logic and the data-driven approach of LLMs to identify inferences in natural language.

Through this research, we aim to provide an understanding of the strengths and limitations of this approach and its potential applications in product development. We evaluate the effectiveness using real requirements specifications, demonstrating its potential and paving the way for future optimization.

This paper significantly contributes to automated contradiction detection in requirements expressed in controlled natural language (Schwitter 2010). They are particularly relevant in the automotive domain and related fields, as sentence templates are standard for formulating requirements (Dick 2017; GmbH 2016). However, it is essential to note that formal requirements have different definitions. The requirements we consider are more natural and intuitive than those discussed in the related work (see Sect. 3.2).

To establish the groundwork for this research, we built upon our previous work (Gärtner et al. 2022), which proposed six essential questions for identifying contradictions. In this paper, we implemented these questions and improved their effectiveness by modifying and expanding them, resulting in a comprehensive set of seven categorized inquiries. These refined inquiries serve a dual purpose: firstly, to determine the presence of contradictions in the requirements, and secondly, to identify the specific types of contradictions that may be present.

We provide a detailed, step-by-step explanation of our implementation and discuss the evaluation using real-world requirements.

2 Fundamentals

There are several ways to identify contradictions between sentences without human intervention. In the following section, we clarify the term contradiction, discuss two prominent methods—formal logic and machine learning—and explain why we chose a combination of both.

2.1 Contradictions

Contradictions are statements that conflict with one another. However, the definition is not intuitively clear, so it must be explicitly defined:

Different definitions of contradictions can be found in the literature on requirements engineering (RE). To get the most generically valid and scientifically accepted definition, we build our theory on the logical philosophy of Aristotle (Aristotle 1999), as shown in Gärtner et al. (2022). The foundation of his logic—also known as term logic, traditional logic, or formal logic—developed in his work Metaphysics is the law of non-contradiction (LNC) (Horn 2018). Derived from this, Fig. 1 shows different types of RE-relevant contradictions.

Fig. 1
figure 1

Contradictions relevant to RE

In alignment with the interpretation of Karlova-Bourbonus (2019), it should be acknowledged that contraries, in addition to contradictories, will be considered. Despite falling under the term of contradictions, it is imperative to recognize that they are not synonymous and must be distinctly discerned.

  • Contradictory opposites, such as he is sick and he is not sick, are mutually exclusive; one must be true if the other is false, without overlap.

  • Contrary opposites, like it is black and it is white, cannot be true at the same time but can both be false, indicating they are not exhaustive.

  • Subaltern relation follows, where if everybody is sick is true, it implies some people are sick must also be true, demonstrating a logical step-down.

In practice, identifying contradictions in natural language can be challenging. Based on empirical evidence, it is apparent that only a small number of contradictions observed in real-world situations exhibit explicit characteristics. ‘Rather, contradictions make use of a variety of natural language devices […]. The most sophisticated kind of contradictions, the so-called implicit contradictions, can be found only when applying world knowledge and after conducting a sequence of logical operations’ (Karlova-Bourbonus 2019) such as ‘the car must be as fast as possible and ‘the car must be as fuel efficient as possible.’ Those familiar with physical rules know that the faster a car drives, the more fuel it consumes. In this study, however, we focus solely on explicit (prototypical) (De Marneffe et al. 2008) contradictions, such as ‘The car is fast’ and ‘The car is slow,’ without conducting deep meaning processing.

2.2 Theoretical background for NLP

In the following section, we will explore formal logic as a subset of symbolic AI and Large Language Models (LLMs) as a subset of machine learning (ML).

2.2.1 Formal logic

Symbolic artificial intelligence represents intelligence using abstract symbolic representations and logical rules, which are the core elements of formal logic. Unlike earlier approaches aimed at imitating human thought processes (Newell and Simon 1956) the symbolic AI paradigm sought to replicate human knowledge and understanding at a conceptual level.

Formal logic plays a crucial role in symbolic AI, as it encompasses a wide range of applications, including handling simple mathematical tasks, performing string comparisons, and extending to more complex symbolic reasoning and logical operations such as rule-based reasoning, knowledge representation, expert systems, and automated theorem proving. Tasks such as parsing sentences, planning responses, reasoning about meaning, and building expert systems have relied on symbolic problem-solving techniques (Jurafsky and Martin 2019)

It is essential to acknowledge that symbolic AI, which relies on formal logic, has achieved significant milestones. However, it faces limitations when dealing with complex, ambiguous, and uncertain problems (Russell and Norvig 2016). It struggles with tasks that require fluid and intuitive thinking, perceiving subtle distinctions, and generalizing knowledge across domains: it would have difficulties determining whether something could be fast and big simultaneously but not fast and slow.

While rule-based systems alone cannot match human intelligence, they provide key capabilities such as logical reasoning, explanation, and abstraction integral to human cognition (Johnson-Laird 2006; Miller 2019).

2.2.2 Machine learning

Machine learning (ML) is an approach to AI that uses data-driven techniques to train models that can learn and make predictions. Unlike Symbolic AI, ML algorithms learn patterns and relationships from large amounts of data, enabling the systems to adapt to new information and make accurate predictions in real-world scenarios.

Classical ML models excel in specific tasks like image classification, speech recognition, or natural language processing. However, they may struggle to identify contradictions across diverse contexts, as their knowledge is confined to the data used during training (Goodfellow et al. 2016; Kim 2018). Therefore, even these models would have difficulty determining the relationships between something being fast and big or fast and slow simultaneously.

On the other hand, LLMs like GPT (openai 2020) and LLaMA (Touvron et al. 2023) are pre-trained on extensive datasets, allowing them to learn general language patterns and relationships. This broad knowledge enables LLMs to better understand context and semantics, crucial for identifying contradictions (Surana et al. 2022). They can indeed identify that something could be fast and big simultaneously but not fast and slow. This is why, in this work, we use LLMs to detect contradictions.

3 Related work

Previous research in requirements management has proposed several methods for tackling inconsistencies between requirements. The following sections discuss taxonomies, NLP-based conflict detection, and ontology-based conflict detection approaches.

3.1 Classifying conflicts

In our foundational work (Gärtner et al. 2022), we introduced a nuanced taxonomy for detecting contradictions based on Aristotle’s Law of Non-Contradiction, recognizing the need for a systematic approach tailored to the intricacies of requirements engineering. This classification emerged from a rigorous analysis of contradictions commonly observed in requirements documents, ensuring a methodical and scientifically grounded approach rather than an arbitrary categorization. By adopting this classification, we aimed to bridge the gap between theoretical logic and practical application in software development, ensuring our methodology is grounded in logical rigor and applicable to the demands of modern engineering projects.

The proposed method categorizes contradictions into distinct types: contradictories, contraries, and subalterns, which fall into three subcategories: Simplex, Idem, and Alius, as shown in Fig. 2. The decision to focus on these subtypes was driven by their prevalence in real-world requirements and unique challenges in automated detection and resolution.

Fig. 2
figure 2

Nine types of contradictions for RE (Gärtner et al. 2022)

Simplex contradictions (from Latin ‘simple’) are characterized by direct opposition without conditional statements. These contradictions are straightforward but crucial for establishing our classification framework. For example, ‘The car must be red’ versus ‘The car must be blue’ showcases apparent, uncomplicated contradictions.

Idem contradictions (from Latin ‘same’) involve identical conditions leading to contradictory outcomes, presenting challenges due to their conditional nature. An example is ‘If the customer wishes, the car must be red’ and ‘If the customer wishes, the car must be blue,’ where the same condition yields conflicting requirements.

Alius contradictions (from Latin ‘different’), distinguished by differing conditions that result in incompatible conclusions, illustrate the complexity of engineering requirements. An instance of this is ‘If the customer wishes, the car must be red’ versus ‘If the car has four doors, the car must be blue,’ demonstrating how different conditions can lead to contradictory outcomes.

These categories—Simplex, Idem, and Alius—were chosen to address the varied and complex nature of contradictions in real-world requirements. By providing a clear structure for identifying and analyzing these contradictions, our approach enhances the capability of automated tools to manage these challenges.

Different types of contradictions occur in varying numbers, have varying levels of project impact, and require different detection approaches. As a result, our previous study defines questions that can be used to identify these types of contradictions. This standardized solution is the basis for the fully automated contradiction detection presented in Sect. 4 of this paper.

Guo et al. (2021) suggest classifying conflicts into three fundamental types: inconsistencies, inclusions, and interlocks, each of which can be further subdivided into seven subcategories. Although this approach shows potential, the authors do not provide a clear method for identifying these categories.

Wu et al. (2022) categorize contradictions into six types: negation, antonym, replacement, switch, scope, and latent. In this classification, negation—identified through negative words like ‘no,’ ‘not,’ ‘never,’ and ‘neither–nor’—aligns with what we define as contradictory. Both antonyms and replacements fit within our category contraries. The concept of scope, which involves either narrowing or expanding the scope expressed in a sentence, corresponds to our subset category. However, switch is not a contradiction in the context of RE, exemplified by the sentences ‘Sally sold a boat to John’ and ‘John sold a boat to Sally’. This scenario does not necessarily indicate a contradiction, as it is possible for Sally to sell a boat to John while John simultaneously sells his boat to Sally. Lastly, latent contradictions, which are not yet developed, fall outside this paper's focus as we concentrate on explicit contradictions.

3.2 Natural language processing for detecting contradictions

On a general notice, Zhao et al. (2021) report that most studies (67.08%) in the domain of NLP combined with RE consist of solution proposals evaluated through laboratory experiments or example applications. In contrast, only a tiny percentage (7%) of these studies undergo evaluation in industrial settings, resulting in insufficient practical validation.

De Marneffe et al. (2008) suggest a definition for contradictions in NLP tasks and analyze the task’s performance. However, their focus is on implicit rather than explicit contradictions, aligning closely with the switch-contradictions delineated by Wu et al. in the preceding section.

Li et al. (2017) state that traditional context-based word embedding learning algorithms are ineffective for mapping contrasting words. To address this issue, they developed a neural network to learn contradiction-specific word embeddings that can separate antonyms. However, their emphasis is not on explicit contradictions, e.g.: ‘Some people and vehicles are on a crowded street’ and ‘Some people and vehicles are on an empty street’ (Li et al. 2017). While these sentences contrast with each other, they might still simultaneously be correct. Other approaches with the same goal as Li et al. utilize WordNet and Thesaurus to obtain additional antonyms and synonyms as semantic constraints (Chen et al. 2015; Liu et al. 2015). These methods might be helpful for fine-grained optimization, as we will mention in Sect. 6.

Ritter et al. (2008) present an approach for automatic contradiction detection. However, their approach does not address explicit contradictions in the context of RE, so their technique cannot be directly applied to our case. Nevertheless, it may prove helpful for future optimization.

Heitmeyer et al. (1996) introduce a technique called consistency checking for detecting errors in requirements specifications using formal logic. Their paper analyzes formal requirements, called SCR, and addresses issues like type errors, nondeterminism, missing cases, and circular definitions.

However, the requirements are not written using everyday language. Instead, they are expressed in a highly structured manner, where each requirement component has a predetermined category that it must be placed in during the writing process. Our method aims to meet requirements written more intuitively and naturally while still being formal.

Gervasi and Zowghi (2005) explore using formal logic and natural language parsing techniques to identify inconsistencies in requirements. The authors propose a method that automatically discovers and addresses these logical contradictions using theorem-proving and model-checking techniques. By translating the requirement into a set of logic formulae, they can detect contradictions such as \(\propto\) and its negation \(\neg \propto\). Hunter and Nuseibeh (1998) suggest utilizing formal logic, which records and tracks the information, enabling the tracking of inconsistent information by propagating labels and their associated data. While they do not only focus on negating assumptions like Gervasi et al., Hunter et al. can only detect conflicts if at least a part of the argument is negated, such as \(\propto\) and \(\neg \propto\). Therefore, Gervasi et al. and Hunter et al. cover an essential part of the issue addressed in this paper, namely contradictories (see Fig. 2). However, they do not consider contraries and subalterns, such as ‘the car must be red’ and ‘the car must be blue’ or ‘it must be under 20’ and ‘it must be under 10’.

Other research papers that use NLP focus on categorizing requirements, such as distinguishing between functional and non-functional requirements (Kurtanovic and Maalej 2017), addressing security-related requirements (Jindal et al. 2016) or finding contradictions in prose texts (Karlova-Bourbonus 2019; Sepúlveda-Torres et al. 2021).

4 Automation method

The combination of formal logic and LLMs leverages their strengths to identify contradictions between statements. Formal logic offers a formal framework for reasoning about relationships between statements. At the same time, the LLMs ability to access world knowledge can be used to ask specific prompts, guiding ALICE to identify contradictions. The prompt engineering process involves designing and fine-tuning prompts to elicit the desired behavior from the model. This combination enables the model to perform logical inference and identify contradictions scalable and flexibly. The resulting system can be used to evaluate the consistency of arguments in natural language text. While we cannot provide a comprehensive account of all the detailed procedures and optimizations used in the code, we will explain the relevant steps in this context in the following sections.

4.1 Method fundamentals

In this section, we will explore the core components of our approach, including a decision tree, formal logic, and GPT prompting.

4.1.1 Decision tree

In our previous work, we implemented a classifier system to identify if and what type of contradiction is present. In the present paper, we designed this system as a decision tree, as shown in Fig. 3. Here, seven questions must be answered to determine the specific type of contradiction. The first four questions refer to the effects of the requirements, and the following three questions refer to the condition, if any. One notable advantage of this method is its modular nature, allowing easy adaptability to future advances in the field. Therefore, if more efficacious methodologies for resolving the questions are developed subsequently, their integration would be unproblematic. Based on these seven questions, we show how ALICE leads to the desired goal and discuss the advantages of each approach (ALICE or LLMs-only) in Sect. 5.

Fig. 3
figure 3

Modular decision tree for contradiction detection in RE

4.1.2 Formal logic

To detect contradictions, we need the constituents, i.e., condition and effect, as well as variable and action, as seen in Fig. 4. This analysis is pivotal, establishing the core structure of our method to address the seven questions proposed by our model effectively.

Fig. 4
figure 4

Showcase of condition and effect, as well as variable and action in a formal requirement

Take the requirement ‘If the threshold is reached, the controller must limit the speed decay’ as an example. The segments ‘If the threshold is reached’ and ‘the controller must limit the speed decay’ serve as the condition and the resultant effect, respectively. Within the condition, ‘The threshold’ represents the variable, and ‘is reached’ the action. Similarly, within the effect, ‘the controller’ is the variable and ‘must limit the speed decay’ the action. This is shown in Fig. 4.

In terms of implementation, this step involves utilizing symbolic AI using a grammatical model that incorporates grammatical rules, trigger words, and POS tags to detect conditions, as shown in Gärtner et al. (2023).

It is important to note that conditional statements are not statements of causality. The presence of oxygen, for example, is a condition for a match to light, whereas striking the match is the cause. The lighting of the match is called the effect. In terms of implementation, this step involves utilizing symbolic AI using a grammatical model that incorporates grammatical rules, trigger words, and POS tags to detect conditions.

For the contradiction detection presented in this paper, we further extended this logic by incorporating logical and mathematical rules, including string similarity comparisons and mathematical operations.

4.1.3 GPT

To leverage the power of LLMs while addressing the unpredictability of their responses, we adopted a frozen model approach using GPT3. Utilizing this version ensured no updates would be applied to the model during our experiments, providing consistency in its behavior. The later models, such as GPT3.5 and GPT4, are subject to constant updates by Openai (2023a). For the sake of completeness, we have nevertheless tested and briefly evaluated the newer models. In our implementation, we employed the following configuration settings: the engine was set to text-davinci-003, the prompt was provided as the input, a temperature of 0 was used to ensure deterministic responses, the maximum number of tokens was set to 5, top-p sampling with a probability of 1 was employed to avoid randomness, and both the frequency penalty and presence penalty were set to 0 to encourage a neutral influence on the generation process. This configuration allowed us to explore the capabilities of the frozen GPT3 model and examine its outputs in a controlled and predictable manner.

4.2 Preprocessing

In NLP tasks, preprocessing is critical in preparing unstructured textual data for analysis. Preprocessing aims to convert raw text into a structured and meaningful format suitable for further analysis. Our research builds on our prior work (Gärtner et al. 2023), making it part of ALICE. It utilizes a fully automated preprocessing pipeline consisting of the following stages: (1) data reduction to replace or eliminate special characters such as ‘<’ with their corresponding expressions; (2) text parsing to generate dependency trees and perform part-of-speech tagging; and (3) identification of verbal expressions, including conditions, effects, variables, and actions, collectively referred to as constituents, as explained in Sect. 4.1.2.

4.3 Questions

The following section summarizes the seven key questions that form the basis of the ALICE methodology for detecting contradictions within requirement specifications. Each question targets a specific aspect of potential contradictions, enabling a structured and thorough analysis. Detailed examples and GPT-based prompts for each question are provided in the “Appendix”.

  1. 1.

    Variable identity: Identifies whether the two requirements variables are identical or subsets.

  2. 2.

    Effect inclusivity: Investigates whether one effect encompasses another, which is essential for identifying subaltern contradictions.

  3. 3.

    Mutual action exclusivity: Identifies directly opposing actions, indicating contradictions where the truth of one negates the other.

  4. 4.

    Mutual action inconsistency: Assesses if actions are inherently contradicting without being direct opposites, which is essential for identifying contraries.

  5. 5.

    Condition presence: Identifies conditional clauses that might affect the interpretation of requirements, helping eliminate Simplex-contradictions contradictions.

  6. 6.

    Condition equivalence: Compares conditions in two requirements to detect Idem-contradictions that require the same condition.

  7. 7.

    Condition co-occurrence: Evaluates the possibility of two conditions coinciding, helping eliminate Alius-contradictions contradictions.

5 Validation and results

We adopted a structured empirical validation protocol based on established research methodologies to ensure a rigorous and systematic evaluation of ALICE. Our approach is delineated by specific research questions (RQs) designed to assess the effectiveness and applicability of ALICE in identifying and resolving contradictions in engineering requirements.

  • RQ1: How does ALICE compare to existing LLM approaches regarding accuracy and recall in detecting contradictions within complex specification sheets?

  • RQ2: How does employing ALICE affect the efficiency and scalability of contradiction detection processes in real-world datasets?

5.1 Data analysis and methodology

We used three datasets in our study. Dataset 1 was compiled specifically for this study and served as an initial test suite for developing our method. This dataset, see “Supplementary Information”, includes all possible types of contradicting requirements pairs as well as non-contradicting pairs. Datasets 2 and 3 were obtained from a recent real-world electric bus project: A comprehensive set of requirements was created to outline a modular system intended to replace traditional bus powertrains with modern electric ones. The significance of requirements in developing electric buses can be found in the publication Design of urban electric bus systems (Göhlich et al. 2018). Usually, real-world specification sheets cannot be shared with the public due to confidentiality concerns. Nevertheless, we have taken steps to anonymize 210 requirement pairs and have provided access to them, as detailed in Sect. 9.

Dataset 2 contains 1071 requirement pairs, which have been manually checked and labeled for contradictions. Hence, it can be used to validate our method. Dataset 3 contains 3916 pairs, which were not manually checked. On the one hand, it was used to show that the method can handle large datasets, and on the other hand, it served to compare ALICE and GPT3.

Our datasets are inherently unbalanced, reflecting the real-world scenario where the vast majority of requirements do not contradict each other. This imbalance is not a flaw but a feature of using authentic datasets, where contradictions are naturally rare when comparing each requirement against others. Our analysis methods are tailored to recognize and effectively handle such skewness, aiming to accurately identify the relatively few, yet critically important, contradicting combinations. This approach underscores the practical applicability of our method in real-world settings, where the objective is not to balance data artificially but to mirror and navigate the complexities of actual requirement specifications.

Additionally, rather than employing machine learning techniques to train on the unbalanced dataset, our methodology leverages formal logic and pre-trained LLMs, reinforcing the method’s adaptability and efficacy in handling the intricacies of requirement specifications without the need for dataset balancing. Traditional machine learning techniques often necessitate a significant quantity of labeled data to effectively capture the underlying features and patterns within a specific problem domain. Due to the absence of a large dataset specifically tailored for contradicting requirements and inherently unbalanced datasets, incorporating such ML techniques becomes unfeasible.

We compared the results of ALICE to the results when using only LLMs. We conducted an evaluation solely focused on detecting the presence of contradictions without classifying the specific type of contradiction: While ALICE demonstrated the capability to identify specific types of contradictions, LLMs are not equipped with this classification awareness. Contradictions in specialized industrial domains hinge on subtle semantic details. However, LLMs are trained on broad datasets, which limits their ability to specialize in such nuanced tasks. Thus, we limited the evaluation to general contradiction detection to ensure comparability.

Finally, our evaluation did not consider methods using only formal logic, see Sect. 3.2. As mentioned, traditional context-based word embedding algorithms like Word2Vec or GloVe are ineffective for mapping contrasting words (Li et al. 2017). For instance, they would both struggle to detect that a car can drive and be red simultaneously, but it cannot drive and be in the air simultaneously. The latter example represents an apparent contradiction that would require real-world knowledge to identify.

5.2 Results for dataset 1

We present the first dataset’s results based on the method presented in Sect. 4. It comprises 87 requirement pairs, of which 61 exhibit contradictions. The selection of the contradictory pairs encompasses all nine types of contradictions and considers as many variations in their formulations as possible.

An extract of the 87 pairs is shown in Table 1, with one example for each type of contradiction and some examples that are not contradictions.

Table 1 Extract from dataset 1

Table 2 displays the confusion matrices as tabular representations of ALICE and GPT3 performance. In these matrices, True Positives and True Negatives indicate correct predictions for contradictions and non-contradictions, respectively. False Positives and False Negatives represent errors where non-contradictions were mistaken for contradictions and vice versa. From these figures, we derive key metrics like accuracy, precision, and recall, which reflect the model’s performance in identifying true contradictions among the dataset (Shultz et al. 2010).

Table 2 Results for reference dataset in the form of confusion matrices (Color figure online)

The LLM-only prompts were formulated as shown in Fig. 5. The return value is then analyzed regarding whether it contains a string with the content ‘yes,’ after which the answer is considered as positive.

Fig. 5
figure 5

LLM prompt-pseudo code

ALICE achieved an accuracy of 72%, a recall of 75%, and a precision score of 83%. On the other hand, the LLM-only method achieved an accuracy of 47%, a recall of only 32%, and a precision score of 80%. Attempts to guide the model’s behavior using n-prompting proved unsuccessful. Several attempts were made, mainly using examples of contradictions and the seven questions described in Sect. 4.3. N-prompting, or “chain of thought” prompting, involves guiding a language model through a series of logical steps to improve its problem-solving accuracy. This technique mimics human reasoning by breaking down complex problems into simpler, articulated steps before concluding.

ALICE is more likely to identify contradictions (i.e., 9 + 45 = 44) than the LLM (i.e., 5 + 20 = 25). This is even more clear in the following datasets.

5.3 Results for dataset 2

In this section, we analyzed 1071 combinations to validate our method. ALICE achieved an accuracy of 99%, a recall of 60%, and a precision score of 94%. On the other hand, the LLM-only method achieved an accuracy of 97%, a recall of 0%, and a precision score of 0%. Due to unbalanced data, the accuracy alone may not be a highly relevant metric. The recall, which measures the proportion of correctly identified positives, provides a better understanding of the results. Accordingly, ALICE detected 60% of all contradictions, whereas the LLM method did not detect any. For a more comprehensive analysis of the results, the confusion matrices are shown in Table 3.

Table 3 Confusion matrices for the first dataset (Color figure online)

An example of an accurately identified contradicting pair (true positive) by ALICE is as follows:

  1. 1.

    If the value of the signal B_LdnTrvOfSin3Load_fKlLdngr exceeds the value of 0, the value of the segment output signal LdnProcTrvtv must be set to True.

    $${\text{Formal}}:if\; condition\_1 \Rightarrow signal\mathop = \limits^{!} True$$
  2. 2.

    If the Parameter Load_DomWrtTgtLdnProcTrvtvToSW is set to True, the signal LdnProcTrvtv corresponds to the configurable value Load_DomWrtLdnProcTrvtv.

    $${\text{Formal}}:if\;condition\_2 \Rightarrow signal\mathop = \limits^{!} config.$$

This is a contradiction of the type Alius contrary since it does not exclude that \(condition\_1\) and \(condition\_2\) may occur at the same time, while the effects (\(signal\mathop = \limits^{!} True\) and \(signal\mathop = \limits^{!} config\)) contradict each other.

It is a potential contradiction because it is theoretically feasible that the conditions have been defined in another place so that parallel occurrence is excluded. Automated detection of such scenarios is not within the scope of this paper but is conceivable for the future. Nonetheless, a manual verification was conducted on a selective basis, where the absence of corresponding exclusions was confirmed for the prior example: The first requirement belongs to the Function section, while the subsequent requirement belongs to the Manual Output section. It could be assumed that the Manual Output supersedes all other requirements due to its later placement; however, this assumption lacks clarity and unambiguity. Sequential processing of requirements must be explicitly defined and is not inherently evident. On the contrary, in cases where sequentially overwriting actions are present, they must be distinctly indicated. Therefore, this is indeed a real contradiction.

An example of a requirement pair falsely identified as Alius contrary (false positive) by ALICE is as follows:

  1. 1.

    If the value of the signal B_Mt[…]Mngt is above the value of Load_DebVwTgtInpSig = = 1 s for longer than Load_ThdOfLdnTrvMtl = = 0.8, the value of the segment output signal LdnProcTrvtv must be set to True.

    $${\text{Formal}}:if\; condition\_1 > C \Rightarrow signal\mathop = \limits^{!} True$$
  2. 2.

    If none of the input signals is above the value of 0 and the value of the signal B_Mt[…]Mngt falls below the value of Load_DebVwTgtInpSig = = 5 s for longer than Load_ThdOfLdnTrvMtl = = 0.8, the value of the segment output signal LdnProcTrvtv must be set to False.

    $${\text{Formal}}:\left( {if\; condition\_2} \right) \wedge \left( {if\; condition\_3} \right) \Rightarrow signal\mathop = \limits^{!} False$$

The issue, in this case, traces back to the response provided for question 7. ALICE’s approach fails to distinguish between ‘above x’ (condition_1) and ‘below x’ (condition_3) as mutually exclusive, leading it to believe that both conditions can co-occur. This would result in contradicting effects, thereby rendering the requirements contradictive. It is important to note that—in the current configuration—question 7 involves GPT3 as LLM. It is possible that more advanced LLM models could yield better results in the future.

An example of a falsely identified contradicting pair (false positive) when relying solely on GPT3 (LLM-only) and not using ALICE is as follows:

  1. 1.

    If the CAN status bit B_NchPsst12333 has the value False, an error must be written to the segment-internal error bus.

    $${\text{Formal}}:if \;condition\_1 \Rightarrow x\mathop = \limits^{!} error$$
  2. 2.

    If the CAN status bit BVwLoad_sOfNch12333 has the value True, an error must be written on the segment-internal error bus.

    $${\text{Formal}}:if \;condition\_2 \Rightarrow x\mathop = \limits^{!} error$$

When asking GPT why it thinks like that, it explains, 'Since these are different CAN message status bits, both one and the other cannot be fulfilled at the same time. Therefore, the requirements contradict each other.' ALICE correctly identified this as not contradicting.

5.4 Application to large datasets

The third dataset consists of 3916 combinations of requirements, which took approximately 4 h to complete. GPT-3 took 0.9 s on average per requirement pair, resulting in approximately one hour for the whole dataset. ALICE achieved a precision score of 86%, signifying high accuracy in its predictions. In contrast, GPT3’s precision score was 0%. Our focus during this analysis was not on optimizing for speed or code efficiency; thus, not all combinations could be thoroughly analyzed within the scope of this research. We did not employ parallel processing techniques, indicating a potential for future performance enhancements. Nevertheless, the analysis successfully showcased the method’s capability to process large datasets and underlined the combined method’s advantage over solely using GPT-3, as shown in Table 4.

Table 4 Results for dataset 3 in the form of confusion matrices (Color figure online)

5.5 LLM comparison

We evaluated the datasets using not only GPT3 but also GPT-3.5 turbo and GPT-4 as well as LLaMA by Meta. Our findings indicate that GPT-3.5 turbo had a limited performance, detecting only one contradiction. On the other hand, GPT-4 detected five contradictions, marginally underperforming when compared to GPT-3. LLaMA detected three contradictions. The difference between GPT-3 and GPT-3.5 is particularly evident in question 4, ‘Are the actions mutually inconsistent?’ which involves two questions: Question 1: Does ‘must be set to true’ mean the same as 'corresponds to the parameterisable value Ldn_C'? And Question 2: could the following statements potentially contradict each other? 'it must be set to true’ 'it must correspond to the parameterisable value Ldn_C.'

Table 5 depicts the divergent responses obtained from GPT-3, GPT-3.5, GPT-4, and LLaMA, with GPT-3.5 yielding a less satisfactory answer than the other two models. While GPT-4’s response is akin to that of GPT-3, its computational time is notably higher, and thus, we opted to employ GPT-3. LLaMA comprehends problems or inquiries accurately but frequently derives erroneous conclusions, generating increasingly imaginative responses.

Table 5 Different answers generated by GPT3, GPT3.5, GPT4 (22.03.2023) and LLaMA

This outcome is likely not attributable to a lack of quality in the model but instead stems from an alternative objective for which the model was intended. An illustrative example of LLaMA’s propensity to stray further from the target with increasing response length is provided in Table 5. The first answer was accurate; however, subsequent elaboration led to irrelevant content. The initial response to the second question was inaccurate, but the ensuing explanation was correct. Therefore, one could even argue that the second answer is an oxymoron, as it contradicts itself. Notably, no definitive comparison between LLaMA and GPT can be drawn from this study. It is plausible that the model’s performance may vary depending on the nature of the input stimuli. Future investigations utilizing different input conditions may yield more informative insights into the comparative performance of the two models.

5.6 Criticality assessment

ALICE enables the conclusion on the criticality of requirements and, to some extent, on the development costs. Contradictions falling under the types Idem (e.g., \(A \Rightarrow x\mathop = \limits^{!} c\) and \(A \Rightarrow x\mathop = \limits^{!} k\)) and Simplex (e.g., \(x\mathop = \limits^{!} c\) and \(x\mathop = \limits^{!} k\)) are always considered real contradictions because either there is no condition or the conditions are identical. However, contradictions under the Alius type (e.g., \(A \Rightarrow x\mathop = \limits^{!} c\) and \(B \Rightarrow x\mathop = \limits^{!} k\)) are potential contradictions.

As noted in Sect. 5.3, it is a potential for contradiction as it is theoretically possible for the conditions to have been defined in a manner that precludes their parallel occurrence. While this paper does not address the automated detection of such scenarios, it remains a possibility for future exploration.

Furthermore, question 5, ‘Is there a condition?’ provides additional insight, as noted by Frattini et al. (2022), regarding the correlation between the existence of conditions, lead times, and volatility of requirements. Fischbach’s findings suggest that conditions with a strict semantic structure can lead to a more understandable requirement, which can subsequently facilitate the translation of the requirement into downstream artifacts, such as code or test cases.

5.7 Conclusion

After conducting a thorough evaluation, we now return to our initial research questions to contextualize our findings within the broader objectives of our study.

RQ1: How does ALICE compare to existing LLM approaches regarding accuracy and recall in detecting contradictions within complex specification sheets?

Our analysis reveals that ALICE demonstrates a notable improvement in both accuracy and recall rates over LLM-only approaches. As evidenced by the results in Dataset 1, ALICE achieved an accuracy of 72%, a recall of 75%, and a precision score of 83%, significantly outperforming the LLM-only method, which registered an accuracy of 47%, a recall of 32%, and a precision score of 80%. For Dataset 2, ALICE achieved an accuracy of 99%, a recall of 60%, and a precision score of 94%. On the other hand, the LLM-only method achieved an accuracy of 97%, a recall of 0%, and a precision score of 0. These findings underscore ALICE's ability to detect contradictions accurately, highlighting the effectiveness of integrating formal logic with LLMs.

RQ2: How does employing ALICE affect the efficiency and scalability of contradiction detection processes in real-world datasets?

The approximately 4-h-long evaluation of Dataset 3, comprising 3916 requirement pairs, demonstrates ALICE’s proficiency in managing large datasets efficiently with a precision score of 86%. Despite the inherently unbalanced nature of real-world datasets, ALICE’s methodology was adept at identifying the relatively few but critically important contradicting combinations. These results have been critically evaluated and validated as actual contradictions by the second author, who brings over two decades of experience at a major automotive OEM in Germany. His last role involved responsibility for concept design and component integration of various vehicles, underscoring his requirements engineering expertise (Knothe et al. 2006; Göhlich 2008). This showcases ALICE's practical applicability and scalability, marking a significant step in automating contradiction detection in engineering requirements.

6 Limits

Several limits to the method must be considered: on the one hand, limits of the LLM, and on the other, limits of formal logic.

6.1 Limits of LLMs for contradiction detection

LLMs lack adequate data to facilitate the identification of specific types of contradictions. Therefore, we consider the limits of general contradiction detection with LLMs.

Detecting fundamental contradictions is a straightforward task, and the corresponding reasoning can be effectively deduced, as exemplified in the subsequent instance, as seen in Fig. 6:

Fig. 6
figure 6

GPT3 has enough knowledge to detect and explain the present contradiction

Nevertheless, LLMs do not accurately detect contradictions according to the definition in Sect. 2.1. It is worth noting that the model’s response is not necessarily incorrect but rather inappropriate in the RE context. It is plausible that a human, unaware of the context, might respond similarly to GPT3, as shown in Figure (Fig. 7).

Fig. 7
figure 7

GPT3 is not able to properly detect the present contradictions. Firstly, the sentences are not conditions but mere sentences, and secondly, although the conditions can indeed be true at the same time, this would lead to contradicting effects

Another example of LLM shortcomings was mentioned in Sect. 5.2 as the false positive example concerning the opposite words below and above, which were not correctly classified. One way to address such shortcomings is to combine machine learning models, such as detecting opposing words from Li et al. (2017).

Furthermore, an inconsequential usage of the words must and shall can lead to issues with LLM models. In some cases, the model did not recognize these words as synonyms and labeled the effects differently. In general, however, one should refrain from using both words in the same document when writing requirements (GmbH 2016).

Also, LLMs have trouble with strings that are too long. For example, a parameter name such as OpfsOpfsCnvrGfgsddhFTiOutWRTEForHGFoUftToGhfS cannot be analyzed as it is a ‘long and complex string that doesn’t provide any meaningful information to perform a character comparison’ according to GPT. The string is analyzed correctly after inserting an underscore between the signal name, such as OpfsOpfsCnvrGfgs_ddhFTiOutWRTEForHGFoUftToGhfS.

Finally, for example, when specifying requirements, using the phrase

it must be set equal to TRUE

in one instance and in another instance

it must be not be manipulated

it would not result in a contradiction. However, substituting the second instance with

it must be not be set

it would produce a correct contradiction detection. Although this may not be immediately apparent, employing precise and consistent language in technical writing is critical to prevent errors and misinterpretations. Thus, adhering to formal language conventions when dealing with NLP tasks in technical documents is highly recommended.

6.2 Limits of formal logic

Filler words can introduce errors in natural language processing tasks. For instance, consider the sentence ‘Amy and I both have to fight him,’ where the term both is treated as the variable instead of Amy and I. On the other hand, when comparing it to a second sentence, ‘Neither Amy nor I have fought against him,’Amy and I would appropriately be identified as the variables. No contradiction would be observed in this case because the variables are considered distinct. Eliminating the filler word both enables accurate identification of the contradiction in the sentence. These findings suggest that filler words should be cautiously treated in natural language processing applications to ensure accurate results.

Passive sentence constructions are prone to misinterpretation and should be avoided. For instance, the statement ‘There must be no manipulation of Com_Batt’ is better expressed as ‘The function must prevent manipulation of Com_Batt.’ Using active sentence structures enhances clarity and enables the algorithm to identify the variable performing the action. This aligns with general rules for writing requirements that aim to reduce ambiguity (Sophist GmbH 2016).

Improper placement of commas in complex sentences can result in erroneous condition detection, leading to inaccurate identification of variables and actions. As such, it is crucial to ensure that commas are placed correctly in technical writing. Failure to do so may introduce flawed interpretations and subsequent errors. Consequently, it is advisable to exercise caution when using complex sentence structures and to verify that commas are placed accurately to enhance clarity and precision.

The creation of neologisms should be minimized in technical writing. In cases where they cannot be avoided, adding them to a dictionary is crucial to avoid being detected as out-of-vocabulary terms. Therefore, it is advisable to employ established terminology whenever possible.

6.3 Threats to validity

In conducting this research, we have identified and addressed several potential threats to the validity of our findings, guided by established frameworks in empirical research (Wohlin et al. 2012).

External validity: The study does not provide sufficient grounds to make claims about the generalizability of the results for non-formal requirements, as only formal requirements were analyzed. Focusing on formal requirements was driven by their prevalence and criticality in the domains studied. However, it should be noted that the basis of our investigation is generally applicable grammatical rules. Also, the analyzed data sets follow rules automakers generally apply when writing formal requirements (Ahmad et al. 2020; Dick 2017; Sophist GmbH 2016).

Further investigation and comparison with additional data sets are necessary to address contextual factors such as the domain, development process, requirements engineering techniques, and technologies used.

Internal validity: We have minimized researcher and confirmation bias through a standardized methodology based on validated grammatical rules, reducing subjective interpretation. Future enhancements could benefit from practitioner insights and further validation studies.

Construct validity: In our study, we have taken careful steps to ensure that the operational definitions used closely align with the theoretical constructs we aim to investigate, a critical component for maintaining construct validity and ensuring our measurements reflect what we intend to measure.

Conclusion validity: We have not further detailed the statistical methods employed to affirm the reliability of our conclusions. Robust statistical analysis is essential to verify that the relationships observed in our study are statistically significant and not due to chance, thus underpinning the integrity of our study’s conclusions. The potential for overfitting the methodology to the data is a legitimate concern, given the limited availability of a sizable dataset upon which the approach was originally devised.

By acknowledging these limitations and areas for improvement, we underscore our commitment to advancing automated contradiction detection in engineering requirements.

7 Conclusion and outlook

This paper contributes to a fully automated contradiction detection for requirements written in controlled natural language that combines formal logic and LLMs called ALICE. This hybrid method involves prompt engineering, which entails designing and fine-tuning prompts to elicit the desired behavior from the model, such as identifying contradictions. The seven-step process is scalable and can be adapted to future advancements in the field, providing a modular way to evaluate the consistency of arguments.

The performance of ALICE and a purely LLM-based method for automated contradiction detection was evaluated in this study. ALICE achieved a higher accuracy (99%), recall (60%), and precision (94%) than the LLM-only method (97%, 0%, and 0%). These findings show that ALICE is more effective than LLM approaches in identifying contradicting formal requirements.

However, the limitations of both LLMs and formal logic in detecting contradictions in technical documents were also highlighted. LLMs may produce false positives due to the use of opposite words and lack adequate data to accurately detect specific types of contradictions. On the other hand, formal logic may be misled by filler words and improper placement of commas, and neologisms should be avoided to ensure accurate condition detection. To ensure precision and reduce ambiguity, it is recommended to use active sentence structures and established terminology in technical writing.

Combining different machine learning models, such as the detection of opposing words by Li et al. (2017), could be a solution to address these shortcomings. Investigating tailored models should be a priority for future research, as our method is predestined to be customized. On the one hand, various approaches to answering the seven questions should be explored. On the other hand, experimenting with different thresholds for string comparisons might prove beneficial, especially when dealing with datasets that adhere to different formulation rules. Also, we validated the method on requirements at lower, more detailed system levels. Although we believe it would also work on higher levels, a thorough validation with actual requirements should be conducted since the reference dataset included such requirements.

Our study focuses on detecting contradictions between pairs of requirements due to their significant impact. We recognize the potential for contradictions among multiple requirements. Addressing these would require advanced analysis methods. Future work could explore extending our methodology to tackle these more complex scenarios, enhancing contradiction detection in requirements engineering for complex system development.

Furthermore, using the presented method, it is now possible to create labeled datasets—which do not yet exist, as explained in Sect. 5. This could enable the potential use of individual ML models for contradiction detection in the future.

Alius contradictions are potential contradictions due to theoretically infeasible conditions, as explained in Sects. 5.3 and 5.6. An automated analysis of such conditions was not within the scope of this paper; however, it is conceivable for the future.

A subsequent area of exploration involves the integration of ALICE into the product development lifecycle. In another work (Gärtner and Göhlich 2024), we have detailed a comprehensive overview for embedding ALICE into product development processes, offering developers and product owners a practical tool for identifying and resolving contradictions in the development phase. This integration streamlines the requirements engineering process and significantly reduces the risk of costly reworks and project delays associated with unresolved contradictions. Future work will continue to refine these implementation strategies, further elucidating how industry professionals can seamlessly adopt ALICE to bolster the development of error-free, high-quality products.

In conclusion, advanced AI systems can perform sophisticated reasoning tasks by combining formal logic and LLMs. ALICE provides a promising approach to detecting contradictions in natural language text requirements specifications.