Empirical research on requirements quality: a systematic mapping study

Research has repeatedly shown that high-quality requirements are essential for the success of development projects. While the term “quality” is pervasive in the field of requirements engineering and while the body of research on requirements quality is large, there is no meta-study of the field that overviews and compares the concrete quality attributes addressed by the community. To fill this knowledge gap, we conducted a systematic mapping study of the scientific literature. We retrieved 6905 articles from six academic databases, which we filtered down to 105 relevant primary studies. The primary studies use empirical research to explicitly define, improve, or evaluate requirements quality. We found that empirical research on requirements quality focuses on improvement techniques, with very few primary studies addressing evidence-based definitions and evaluations of quality attributes. Among the 12 quality attributes identified, the most prominent in the field are ambiguity, completeness, consistency, and correctness. We identified 111 sub-types of quality attributes such as “template conformance” for consistency or “passive voice” for ambiguity. Ambiguity has the largest share of these sub-types. The artefacts being studied are mostly referred to in the broadest sense as “requirements”, while little research targets quality attributes in specific types of requirements such as use cases or user stories. Our findings highlight the need to conduct more empirically grounded research defining requirements quality, using more varied research methods, and addressing a more diverse set of requirements types.


Introduction
Requirements engineering (RE) is "the process of defining, documenting, and maintaining requirements" [1], where "requirements" designate real-world goals for, functions of, and constraints on systems [2]. The quality of requirements refers to the individual characteristics of requirements that both lead to a successful and cost-effective system, and solves the user's needs [3]. Requirements quality depends on multiple factors such as the requirements author, the stakeholders consulted [4], the templates and processes used [5], and the verification processes followed [6]. Requirements quality research focuses on specific attributes of requirements, including ambiguity, completeness, consistency, complexity, and verifiability [3,[7][8][9].
The 1994 Standish Chaos Report described the perceived importance of requirements on software project success and failure [10]. The report details project success factors such as a "clear statement of requirements" and "clear vision and objectives", as well as project failure factors such as "incomplete requirements" and "changing requirements and specifications" [10]. In 2007, Kamata and Tamai empirically validated the claims of the Standish Chaos Report through an investigation of 32 industrial software projects focusing on the impact of RE on quality, time, and cost calculations [11]. Their research found that a relatively small set of requirements have a strong impact on project success and failure, and projects within time and cost had an acceptable level of quality across all sections of the requirements documents. More recently, a 2017 survey of 136 organisations found that inconsistent, under-specified, and incomplete requirements are consistently rated among the top five reasons for project failures [12].
Requirements quality research addresses these aforementioned aspects in an attempt to increase project successes and limit project failures. The amount of research on this topic leads to key meta-questions such as which requirements quality attributes have primarily been investigated, which empirical methods have been used, and what kind of contexts (such as projects, participants, and documents) have been studied? Despite the broad interest in the field and the existence of multiple industry standards [8,9,13], to the best of our knowledge, no secondary study addressing the full breadth of requirements quality has been published. Pekar et al. provided a brief 4-page systematic mapping study in 2014 of requirements quality [14]. Heck and Zaidman conducted a systematic literature review in 2018 on quality criteria for requirements specifications [15], focusing only on agile software development. Zhao et al. conducted an SMS in 2020 on the use of natural language processing (NLP) for RE [16], limiting their search to articles utilising NLP. Although these secondary studies have approached the area of requirements quality, there is a lack of a general overview. Researchers looking to work on requirements quality and practitioners interested in understanding the state requirements quality have no single starting place in the literature.
To provide a comprehensive overview of the research on requirements quality, we performed a systematic mapping study (SMS) covering several contexts, quality attributes, quality improvement techniques, and RE activities. In particular, we collect empirical research on requirements quality that aims to understand or improve the quality of requirements artefacts. We retrieved 6905 articles from six academic databases. Applying multiple inclusion and exclusion criteria reduced the articles down to 105 primary studies for our SMS. Two of the authors read each of these articles and independently answered 18 questions, including who are the authors, what quality attributes are they addressing, what methods are they using, what RE activity are they targeting, and what tools have been created to assess and improve the quality of requirements. Our findings and discussions serve as a starting place for future empirical requirements quality research. Our full data set (articles, extracted data, and mapping scripts) is available for download in our replication package. 1 The remainder of the paper is structured as follows. Section 2 discusses existing meta-studies in the area of requirements quality. Section 3 introduces the research questions and methodological details of our SMS. Section 4 presents our findings, which we then discuss in Sect. 5. Finally, Sect. 6 outlines the study validity and Sect. 7 concludes the work.
Pekar at al. conducted an SMS that provides a generic overview of 67 studies, published between 1998 and 2013 [14]. They showed that requirements ambiguity, completeness, and correctness are the most investigated problems in RE research. They also show that the literature focuses on linguistic-based techniques for solving ambiguity, consistency checking, alternative perspectives from stakeholders, as well as general frameworks for requirements specification quality assessment.
Heck and Zaidman conducted an SMS that covered eleven primary studies, published between 2001 and 2014, investigating state-of-the-art research into quality of specifications in the context of agile software development [15]. They showed that the main qualities of specifications to take into account are completeness, uniformity, consistency, and correctness. For correctness, the practitioners' literature (e.g. [30,31]) proposed a set of quality criteria, the INVEST model [32], which is limited to user stories. Zhao et al. conducted an SMS in 2020 regarding the application of NLP for RE covering 404 primary studies published between 1983 and 2019, investigating several aspects such as technologies, practices, and activities, [16]. They found that most primary studies target the RE analysis activity, whereas validation and verification-for which quality plays an important role-are the least targeted. The authors show that specification documents are the most studied in the RE research dealing with NLP, while user feedback and other user-generated content are lately getting more attention. The main NLP tasks performed in the context of RE activities are detection, extraction, and classification of information within a requirement document, as well as traceability.
The NaPiRE survey conducted with 136 diverse organisations around the world collected evidence of the industrial relevance for comprehensive requirements quality [12]. The results show that inconsistent, under-specified, and incomplete requirements are regularly rated among the top 5 causes for project failures. Badly written requirements were also reported as one of the problems organisations are struggling with the most.
Our SMS gives a comprehensive overview of empirical research on requirements quality in RE by covering different quality attributes, research methods, and contexts.

Research questions
Our goal is to map the field of empirical requirements quality research to offer interested researchers and practitioners an overview of the state of research, and offer insights towards future work. With this goal in mind, we follow Kitchenham and Charters [33] in creating multiple broad research questions, Napoleão et al. [34] in creating research questions that consider only population and intervention questions, and Petersen et al. [35,36] who provided many examples of research question formats for SMSs. These considerations have led to the following research questions: RQ1. Empirical requirements quality research: Who is publishing in this field, where, when, and on which quality attributes? RQ2. Which empirical research methods are used to study requirements quality? RQ3. Which artefacts, activities, and tools have been studied in empirical requirements quality research?

Research method
To report on our research method, we follow the recommended structure of systematic reviews by Petersen et al [36] and Kitchenham and Charters [33]. This includes four phases: article search, article selection, data extraction, and mapping. We visualise the phases and activities in Fig. 1.

Article search
In the article search phase, we collected the primary studies for this SMS. This included two activities: database search and duplicate removal.

Database search.
The "database search" strategy is the most common search strategy in systematic reviews in software engineering (SE) [35]. We selected the following databases as primary sources: ACM Digital Library 2 , IEEE Xplore 3 , Elsevier ScienceDirect 4 , and SpringerLink 5 Additionally, we selected two indexing systems: Web of Science 6 and Google Scholar 7 We selected these sources using the recommendations of Dyba et al. [37], Kitchenham and Brereton [38], and Petersen et al. [35] in using ACM, IEEE, and two indexing systems. We expanded on their recommendation to include Elsevier and Springer as these four (including ACM and IEEE) "host the major journals and conference proceedings related to SE and RE" [16]. We targeted every study that addresses "requirements" and "quality". However, in a broad sense, all RE research has the aim of improving the quality of something, whether through understanding or intervention. For this reason, we require the primary studies to explicitly address the quality of requirements by mentioning "requirements quality" or "quality of requirements". All aspects of requirements quality and artefact types are of interest to this SMS; therefore, no restrictions were made in the search string.

Mapping
To keep the results in scope, our search string first includes the term "requirements engineering". Without this term, preliminary investigations revealed 3-10x more search results, most of which appeared to be from other research fields such as business and economics 8 . Second, we chose six variations of the term "requirements quality" to account for the power set of 1) synonymous for requirements (i.e. requirements, issues, and specification) and 2) two ways to refer to quality (X quality and quality of X). This resulted in the following search terms: "requirements quality", "quality of requirements", "issue quality", "quality of issues", "specification quality", and "quality of specification". The final search string was " (A and (B or C or D or E or F or G))" where A is "requirements engineering" and B-G are the six terms defined in the previous sentence. Table 1 shows the search strings we used to download the data from each source (depending on the allowed syntax), as well as the results counts. The search string was applied to the metadata in all six searches, and the full text in all but Web of Science. In total, we obtained 6905 articles from the data sources, as shown in Fig. 1.

Duplicate removal.
To handle duplicates, we used the Mendeley 9 "Check for Duplicates" feature which groups the articles into possible duplicate groups. The ultimate decision regarding the duplicate entries was manually resolved by the first author. In most cases, the title, authors, year, and abstract were all identical, leading to an obvious acceptance of the duplication. In cases of conflict between one of the aforementioned details, we retrieved the article and manually checked the details. We performed the duplicate removal activity on the 6905 articles from the database search activity, removing 2295 articles and leaving 4610 articles remaining.

Article selection
In the article selection phase, the first and second author (R1 and R2 10 ) applied the inclusion/exclusion criteria to refine the list of primary studies. The article selection phase has four activities: title selection, abstract selection, candidate selection, and PDF selection. These four activities reduced the number of articles to 3855, 3155, 256, and 216, respectively. Following our research objectives, our inclusion and exclusion criteria were as follows.

Inclusion criteria
I1 Peer-reviewed I2 Empirical Research • Surveys, experimental (including protocol studies), case studies, etc. ACM Digital ("requirements engineering" AND ("requirements quality" OR "quality of requirements" OR "issue quality" 106 Library OR "quality of issues" OR "specification quality" OR "quality of specification")) IEEE Xplore (("quality of issues" OR "requirements quality" OR "quality of requirements" OR "issue quality" OR "quality of specification" OR "specification quality") AND "requirements engineering") 682 ScienceDirect ("requirements engineering" AND ("requirements quality" OR "quality of requirements" OR "issue quality" OR "quality of issues" OR "specification quality" OR "quality of specification")) 251 SpringerLink ("requirements engineering" AND ("requirements quality" OR "quality of requirements" OR "issue quality" OR "quality of issues" OR "specification quality" OR "quality of specification"))

470
Web of Science TS=("requirements engineering" AND ("requirements quality" OR "quality of requirements" OR "issue quality" OR "quality of issues" OR "specification quality" OR "quality of specification")) 56 Google Scholar "requirements engineering" "quality of issues" OR "requirements quality" OR "quality of requirements" OR "issue quality" OR "quality of specification" OR "specification quality" 5340 Total 6905 9 https:// www. mende ley. com/. 10 There are no issues of authority between R1 and R2. 8 Discussion of requirements is popular, but discussion of requirements engineering is not We did this to filter search results which should not have been returned in the database search, such as non-English titles. This activity was only performed by R1 because it was a preliminary exclusion process, and the only articles rejected were obvious rejections where the title of the article described itself as a "secondary study", "thesis", or "vision paper". We performed this activity on the 4610 articles from the article search phase, removing 755 articles and leaving 3855 articles remaining. Abstract selection R1 applied the exclusion criteria to the article abstracts, removing obvious rejections based on information seen in the abstracts. This includes non-English articles, position papers, secondary studies, books, book chapters, and theses. Similar to the title selection activity above, this process was only performed by R1 due to the simplicity of the task (articles describing themselves as a "thesis" in the abstract are clear exclusions that do not require consensus). We performed this activity on the 3855 articles from the title selection activity, removing 700 articles and leaving 3155 articles remaining.
Candidate selection. R1 and R2 read the title and abstract of each article and then independently labelled each as "include", "exclude", or "unsure", following the guidelines of Petersen et al. [35]. R1 and R2 then reviewed the labels, settling disagreements through open discussions.
We evaluated the reliability of our labelling process using interrater reliability (IRR). We report per cent agreement [39,40], Cohen's Kappa [41], and S-Score [42]. Per cent agreement is reported for its simplicity, but the literature consistently warns against using it because it does not take into account the change agreement of the labelling task [41,43,44]. Cohen's Kappa (1960) [41] is reported for its common usage as an IRR metric; however, our data set suffers from the prevalence problem due to the nature of secondary studies having broad search strings, which can cause Cohen's Kappa to substantially misrepresent the IRR of a measure [43]. While some say "Cohen's Kappa should not be calculated in such a situation" [45], we show it for completeness as other researchers argue that bias and prevalence should be included when discussing IRR [46]. Finally, we report Bennett et al.'s S-Score [42] which presents a much more realistic IRR measure when dealing with prevalence at the extremes. To interpret our S-Scores, we refer to Regier et al. [47], as they have a granular and up-to-date interpretation. Table 2 details these three IRR measures for each stage of the candidate selection.
R1 and R2 performed two trial runs prior to the full candidate selection activity to align their understanding of the candidate selection activity. In the first trial run, they labelled 50 articles at random from the 3155 remaining articles, followed by an alignment discussion to resolve disagreements. The first trial run had an S-Score [42] of 67%, which is considered "Very Good" [47]. In the second trial run, the raters again labelled 50 random articles, followed by an alignment discussion. The second trial run had an S-Score [42] of 73%, which is considered "Very Good" [47]. Both raters agreed that the process was refined enough to proceed with a full run. R1 and R2 conducted the full candidate selection run. They labelled 3155 articles, with an S-Score [42] of 72.47%, which is "Very Good" [47]. They made a clear "include" or "exclude" decision regarding each article prior to the data extraction phase. The raters discussed all disagreements and came to a consensus for each article. There were three agreement strategies to assign the final labels. (1) For articles on which both researchers agreed and labelled "include" or "exclude", the agreed-upon label was accepted. (2) For articles on which both researchers agreed and labelled "unsure", they jointly agreed on "include" or "exclude" following an investigation of the article itself. (3) For articles with labelling disagreements, a process similar to (2) was conducted, taking into account the incoming bias held by each researcher. Regardless of the bias, they had an open discussion and agreed on final labels. In situations where they could not reach a confident and comfortable agreement, they chose the "include" label to allow for a more detailed 1 3 investigation in the following phases. We applied the candidate selection activity on the 3155 articles from the abstract selection activity, removing 2899 articles and leaving 256 articles remaining (see Fig. 1).
PDF selection. R1 removed obvious rejections based on information found while gathering the article PDFs, for example, downloading a 400-page PhD thesis. We performed this activity on the 256 articles from the candidate selection activity, removing 40 articles and leaving 216 articles remaining (see Fig. 1).

Data extraction
The data extraction phase was conducted by the first four authors. R1 extracted data from all of the articles, while each of the other three authors extracted from a subset of the articles (88, 60, and 68, respectively). This resulted in each of the papers being peer-labelled by two authors. Through this in depth process of extracting data, more exclusions were discovered. Many of these exclusions were due to a lack of empiricism when reviewing the full article. We performed the data extraction activity on the 216 articles from the Article Selection phase, removing 111 articles and leaving 105 articles remaining (see Fig. 1). This was the final reduction activity, thus arriving at our final number of 105 primary studies.
The process of reading the articles and extracting the data was done independently by each researcher. Once extracted, the researchers met to discuss the extracted questions, and resolve disagreements. The dispute resolution process involved opening up the original article, discussing the disputed question(s) in context, and coming to an agreement. To mitigate potential issues with fatigue, each meeting to  discuss extracted data and resolve disputes was no longer than 60 minutes, with a short break after 30 minutes. There are three data extraction categories that map directly on to the three research questions. Each data extraction category has a number of extraction questions, as listed in Table 3. For each extraction question, a coding style and extraction style were pre-determined based on the desired outcome of the extraction. The two possible coding styles are "open" (10/18 questions) and "closed" (8/18). For each closed coding style question, we created the set of possible labels in advance of the data extraction phase. The list of possible labels evolved during the extraction as we learned about the data. Any change in the labels resulted in re-labelling all previously labelled articles.
The three possible extraction styles are "stated" (8/18 questions), "interpreted" (5/18), and "stated & interpreted" (5/18). We extracted "stated" information verbatim from the article; such information must be explicitly written. A primary benefit from this extraction style is that we can clearly map when authors are explicitly not stating certain things. We extracted "interpreted" information through analysis of the meaning in the article; this information can be explicitly or implicitly written. Research articles are not written using strict templates, and therefore interpretation is necessary to extract most high-level or novel concepts. To extract "stated & interpreted" information, we first extract them as explicitly written, then we utilised thematic analysis to form high-level interpretations of the stated information. The reason for this combination of techniques is 1) to store and reference the original stated values in the replication package, and 2) to provide value to the reader in presenting our interpretation of the stated information.
We interpreted the data through thematic analysis, thereby producing a set of themes through which the reader can understand the data. The thematic analysis was performed by the first author, with the support of the other authors. Our thematic analysis approach followed the "integrated approach" [49] as we have a mix of both closed and open extraction questions. Our overall approach was more deductive than inductive [49] as even our open extraction questions were grounded regarding the labels, given the specific scope and target of the questions.
Who, What, Where, and When. To address RQ1, we created the first data extraction category, "who, what, where, and when". The extracted questions include the "authors", "publishing year", "venue", and "venue type". All of this information is stated either in the primary study or online at the publisher's website. Our SMS had no restriction on the type of venue (conference, journal, workshop, etc.) except that it must be peer-reviewed. The final two extraction questions, "research purpose" and "quality questions" were extracted directly from the primary studies, and required interpretation of the text to be extracted. "Research purpose" will either address the definition, improvement, or evaluation of the quality question(s). "Quality questions" are the specific types of quality being addressed in each article. We list the full details of these extraction questions in Table 4.
Methods. To address RQ2, we created the second data extraction category, "methods". The extracted questions begin with the "methodology", "ground truth", and "results metrics". "Methodology" designates the research strategy chosen by the authors in the investigated article to guide their research. "Ground truth" is the technique used by the authors to align their work with some observable truth. "Results metrics" refers to the metric that the authors used to describe their results. The next four questions are related to the participants of the primary studies. These four questions are "type of study subjects", "# of study subjects", "type of truth-set creators", and "# of truth-set creators". In principle, study subjects have been the subject under study, while truth-set creators are only used for their ability to identify the "truth" in some data set. For both the study subjects and the truth-set creators, we extract the type and number of these subjects used in each article. The final extraction question, "scientific rigour and industrial relevance", is a scoring system for the scientific rigour and industrial relevance of primary studies across a pre-determined set of criteria [51]. We list the full details of these extraction questions in Table 5.
Artefacts, activities, and tools. To address RQ3, we created the third data extraction, "artefacts, activities, and tools". The four extracted questions are "granularity of artefact studied", "type of requirements", "RE activity", and In the SE literature, there are three main research activities: problem investigation, solution validation, and implementation validation [48]. We map these three phases onto the empirical requirements quality research as articles that 1) seek to define requirements quality, 2) propose solutions to improve requirements quality, and 3) evaluate the research from other articles. As all articles in this SMS must contain empirical evidence (by our inclusion and exclusion criteria), evaluating definitions and improvements proposed in the same article are not labelled as "evaluation". Quality attributes The primary purpose of this study is to understand what quality attributes are being addressed within RE research. This open coding category was first extracted as stated.
Once the data had been extracted from all papers, thematic analysis was conducted to form codes and themes.
"requirements quality tools". "RE activity" is the stage of the RE process that the research is addressing. "Granularity of artefact studied" is the smallest unit of artefact that the research needs to be performed. "Type of requirements" describes whether the research addresses functional, nonfunctional, or both types of requirements. "Requirements quality tools" represents which requirements quality tool(s) were created or used in the primary studies. We list the full details of these extraction questions in Table 6.
Additional extraction questions. Additional questions were extracted during the data extraction activity, but they are not presented in this SMS. These questions are the title, DOI, method to address, data set used, algorithms applied, results, and contributions. These results can be found in our replication package.

Mapping
For the mapping phase, each of the research questions was answered using the extracted data. A process of cleaning and The methodology is "the general research strategy that outlines the way in which research is to be undertaken" [50].

Ground truth (Open Coding) (Interpreted)
In what way did the researchers align their work with some observable truth? This could be a labelled data set, manual qualitative investigation, etc.

Results metrics (Open Coding) (Interpreted)
What metrics did the authors use to describe their results? Precision, recall, f-measure, accuracy, time, etc.

Type and # of study subjects (Open Coding) (Stated & Interpreted)
What type of study participants did the researchers study, and how many were there? These participants must have been the subject under study, or tightly coupled with the context of the study such that the results depend on the participants themselves. We extracted this category as stated, and then interpreted it to allow alignment of synonyms used.

Type and # of truth-set creators (Open Coding) (Stated & Interpreted)
What type of truth-set creators did the study involve and how many were there? Truth-set creators are used in the creation of truth sets to align the research, empirically, with some observable truth (e.g. the creation of labelled data sets). Truth-set creators are not study subjects because they are not under study; rather, the algorithms are under study. Research where the truth-set creators are also studied (e.g. the labels are studied with respect to the creators) are accounted for as both truth-set creators and study subjects.

Scientific rigour and industrial relevance (Open Coding) (Stated & Interpreted)
Ivarsson and Gorschek created a model for evaluating scientific rigour and industrial relevance in technology evaluations [51]. This scheme starts with scoring scientific rigour as 0, 0.5, or 1 across three sub-types: context, study design, and validity. There is then a score for industrial relevance of 0 or 1 across four sub-types: subject, context, scale, and research method.
The RE phases extracted for this SMS are elicitation, specification, analysis, validation, verification, management, and maintenance [52]. It is common to see "analysis" and "validation" grouped together, as well as "management" and "maintenance" [52]. In this SMS, we separated them so they can be individually labelled depending on how the authors of the primary studies chose to describe their work. Granularity of artefact studied (Open Coding) The process of improving the quality of requirements usually involves the input of some artefact to be improved. For this SMS, we are interested in the granularity of these input artefacts: words, sentences, requirements, documents, etc. Type of requirements (Closed Coding) The types of requirements extracted for this SMS are functional and non-functional.

Requirements Quality Tools
Which tools were created or used in the primary study? Which quality attributes were addressed by the tool, is there a live link to this tool, what license exists, and who can use it?
organising was applied to the data, followed by producing a number of figures and tables to visualise the results. The data were analysed and summarised using descriptive statistics and frequency analysis. As a result, there are distinct types of figures displayed: bar charts, stacked bar charts, box plots, and maps (also known as categorical bubble plots). The original data (and the associated codes and themes from the thematic analysis) are available in our replication package. The activity of mapping the data was performed primarily by R1, with close consultation with the other authors.

Who, Where, and When
The primary studies were published across 25 years, from 1996 (January 1) until 2020 (March 27), which is the last day articles were downloaded. With the exception of three years, there was a primary study published every year. There is a general increase over time, with a noticeable increase from 2005 to 2014, followed by a period of highs and lows until 2020. Figure

Research purpose and quality attributes
Empirical requirements quality research seeks to define or improve the quality of requirements, or evaluate others' work on requirements quality. In our sample, 14 primary studies presented a definition, 79 studies proposed a solution for improving requirements quality, and 14 studies performed an evaluation of an existing solution. Figure 5 shows the frequency of each research purpose in the studies (studies can address multiple purposes).
Research into requirements quality is characterised by the research goal of addressing one or more quality attributes of requirements. We first extracted the quality attributes as stated in the primary studies, which lead to a list of over 250 unique quality attributes. R1 then performed thematic analysis that lead to a list of codes and themes, which were reviewed and modified through discussion with R2 and R3. Combining similar names and concepts lead to 111 unique quality attribute codes, and grouping those codes lead to 12 quality attribute themes. Figure 6 shows the full breakdown of quality attribute themes and codes. Some quality attribute themes break down into many codes, such as the "ambiguity" theme with 32 codes, where as other themes had no breakdown into codes at all, such as "traceability". From this point on, we will refer to "quality attribute themes" as "quality attributes". Figure 7 shows the number of primary studies addressing each of the 12 quality attributes. The quality attributes are not mutually exclusive, and we found that 55% (58/105) of the primary studies address more than one quality attribute. Quality attributes were addressed across the 105 primary studies a total of 243 times. Not captured in Fig. 7 is that some primary studies addressed the same quality attribute more than once-e.g. by working on multiple types of ambiguity. This duplication of themes is not included in our analysis as it unnecessarily complicates the comparison between quality attributes. Full details are available in our replication package. Figure 7 shows that ambiguity is the most researched quality attribute, representing 20% (49/243) of the researched themes. The first three quality attributes account for 54% (131/243) of all quality attributes addressed. The tail of the graph is fairly consistent at around 12 primary studies addressing each of the remaining themes. "Undefined", on the far right at 3% (8/243) of themes, represents primary studies that did not specify what kind of quality they were researching. These findings are in alignment with two secondary studies that also found that ambiguity [14], completeness [14,15], consistency [15], and correctness [14,15] are the main quality criteria the scientific literature is addressing.
The bubble plot in Fig. 8 shows the interplay of quality attributes and the research purpose of the investigated primary studies. The X and Y axes have categorical values on them, producing a grid-like visualisation. Another dimension, visualised as the size of each bubble, is the number of primary studies where the intersection of the X and Y axes applies. For example, there are 37 studies that have the research purpose of "improvement" and address the quality attribute "ambiguity".
Comparing Figs. 7 and 5 with the bubble plot in Fig. 8, there is a uniform distribution across both dimensions. For example, the research purpose "improvement" is addressed across all primary studies, as shown in Fig. 5. Additionally, "improvement" is addressed most in each individual quality attribute, without exception, as shown in Fig. 8. Looking at individual data points, there are two notable results. First, the quality attribute "understandability" has zero primary studies addressing the research purpose "definition". Second, the top three quality attributes (ambiguity, completeness, and consistency) have substantially more primary studies under "improvement" than "definition" or "evaluation".

Methodologies, ground truth, and results metrics
Empirical requirements quality research is conducted primarily using experiments and case studies, as shown in Fig. 9, which accounts for 93% (102/110) of methodologies used (studies can use more than one methodology). 11 In SE research, empiricism is measured in many ways, including changes in data, methods, and algorithms. At the heart of this empiricism, some observable truth must be established. Our primary studies established their ground truths in five different ways (Fig. 10): • Pre-intervention annotation: The data already exist, and the researchers annotate those data before the intervention. • Post-intervention annotation: The data already exist, and the researchers annotate those data after the intervention. • Existing truth set: Utilising an existing truth set from previous work. • Data set creation: Creating the data, explicitly embedding the truth in how it is created. For example, creating use case descriptions with missing steps, where the missing steps are identified by an intervention. • Manual investigation: The data already exist. The researchers investigate the resulting output or consequences following an intervention on that data.
The number of ground truth methods per primary study can be two or more as the studies presented in some articles use multiple methods. Of all ground truth methods used in the primary studies, pre-intervention annotation and postintervention annotation were used 75% (88/117) of the time.
Existing truth set and data set creation appear in 9% (12/117) and 8% (9/117) of the studies, with the final method being manual investigation with 7% (8/117). Figure 11 shows how each primary study reported their results. Across all primary studies, 34% (36/105) used precision, 32% (34/105) recall, and 13% (14/105) used F-measure. Accuracy was used in 8% (8/105) of primary studies; however, it was often used as a general term, without an accompanying definition. 46% (48/105) of the primary studies reported on "frequency", whether that be of an event, an outcome, a result, etc. Lastly, time was measured and reported in 13 primary studies.

Participants
We found that participants in empirical requirements quality research were being used for two distinct reasons: as "study subjects", but also as "truth-set creators". Study subjects are participants that were studied by the research, while the truth-set creators are participants tasked with labelling data in order to empirically ground the research results. This distinction is important, because participants can be involved in studies in different ways, therefore influencing threats to validity in different ways. For example, authors of primary studies commonly act as their own truth-set creators, but they should very rarely study themselves as study subjects (except in such cases as action research).
We identified four types of participants, namely industry, student, academic researcher, and not specific. "Not specific" means the authors did not describe their participants clearly, or at all. We also include a special category, "no participants", when no participants were used in the article. Figure 12 visualises the number of primary studies that utilised study subjects and truth-set creators according to the four types of participants we identified in the primary studies of this SMS. As primary studies can use more than one type of participant, only the "no participants" label is mutually exclusive. Using the "no participants" label, we know that 44 primary studies (105-61) used study subjects (42%) and 91 primary studies (105-14) used truth-set creators (87%). We found that 94% (99/105) of primary studies used at least one type of participant. Figure 13 maps the quality attributes against the types of participants. On the left side of the figure, the quality attributes are mapped against study subject types, whereas on the right they are mapped against truth-set creator types. The "no participants" column in both shows the number of primary studies that did not use participants. For example, 29 primary studies investigated ambiguity and did not use study subjects.
Each type of participants (industry, academic, student, and not specific) has a number of sub-types (Fig. 14) that we identified during the extraction phase. Primary studies may utilise more than one participant type or sub-type; we only report on the utilisation per article, not the multiplicity of utilisation per article. For example, if a single primary study utilises "industry -manager" and "industry -developer", then it is only counted once under Fig. 12 for "industry", but both are considered in Fig. 14.
Visualising the sub-types of participants highlights additional details in the data. Most importantly, it highlights the " -Unknown" sub-type for each of the types, which denotes that the authors did not fully describe their participants. For example, for primary studies that reported using industry as participants, 47% (25/53) of them only stated "industry", with no further details. Details are lacking in 40% (17/42) of the primary studies using students as participants, and 10% (4/41) of primary studies using academic researchers. Included in this idea of "unknown" is reporting participants as "experts", which has no intrinsic meaning and was used in 7% (6/91) of truth-set creator primary studies. In total, 43% (43/99) of the primary studies describe utilising unknown or expert participants.
Finally, the sub-type "not specific -Not Mentioned" is a label only applied if there was no mention at all of who the study participants were. This accounts for 5% (2/44) of primary studies utilising study subjects, and 22% (20/91) of primary studies utilising truth-set creators. In total, 21% (21/99) of primary studies described utilising participants with no further details. Figure 15 visualise the number of participants utilised per study subject and truth-set distinction, as well as their respective types. Displayed for each set of data is a box plot with the max, min, first quartile, and third quartile. Note that the y-axis is log-scale to highlight the extremes of both the high and low values. We know that there are 44 primary studies that utilise study subjects and 91 primary studies that utilise truth-set creators; however, not all primary studies that utilise participants report the number of participants. Five primary studies that utilise study subjects and 54 primary studies that utilise truth-set creators did not report how many participants were involved. As a result of this missing information, we report the number of participants for 39 study subject primary studies and 37 truth-set creator primary studies.
The first box plot in each figure is the summary of all data across the types, for each of study subjects and truthset creators. We know from Fig. 12 that double the number of primary studies utilised truth-set creators (91) over study subjects (44). In contrast, we can see from Fig. 15 that the study subject sample sizes are four times larger than truth-set creator sample sizes. 12 The last four box plots in each of Fig. 15 represent types of participants. Not all data about participants per type can be displayed here due to some primary studies utilising multiple types of participants, and not distinguishing their respective numbers. Consequently, during data extraction, we summarised the numbers for each primary study that uses more than one sub-type of participant. As a result, 33% (13/39) of the primary studies with study subjects were unusable, leaving us with 26 primary studies from which we could extract the participants sub-types. For truth-set creators, 19% (7/37) of the primary studies were unusable, leaving 30 primary studies to be analysed for their participants' sub-types.
The primary result from the sub-types analysis is that students are used in larger groups, much more than industry and academic researchers. This is true for both study subjects and truth-set creators, and more pronounced in the former. For [TSC] student, the first and second quartiles are over 2.5 times higher than industry and academics, and the third quartile is over four times higher.

Scientific rigour and industrial relevance
Ivarsson and Gorschek [51] outlined a method for evaluating scientific rigour and industrial relevance using a number of criteria (see Table 5). Figure 16 visualises the rigour and relevance values extracted from the 105 primary studies.
The left three bars of Fig. 16 show scores for the scientific "rigour" of the primary studies (either 0, 0.5, or 1). More than half of the primary studies (55) received a score of Fig. 16 Ratings of scientific rigour and industrial relevance for every article [51] (N=105 for each bar) 1 for [rigour] context, and most of the remaining primary studies received a 0.5, with only 13 primary studies receiving a score of 0. There are similar results for [rigour] study design, which shows an overall good quality when it comes to reporting the scientific rigour of the studies. However, [rigour] validity has almost 50% 0's, which means 50% of studies did not report any threats to validity.
The right four bars of Fig. 16 show scores for the industrial "relevance" of the primary studies (either 0 or 1). Overall, there is a low industrial involvement in the analysed primary studies. This could be due to a low interest of industrial practice in some of the studied quality aspects, or it could also be due to our exclusion of non-empirical research such as experience reports which can contain industrial relevance.
[Relevance] subject and [relevance] scale show the highest values, as each have 41 primary studies that were evaluated as a 1 (39%).
[Relevance] research method had 32% (34/105) 1's over 0's, and [relevance] context had the lowest overall score with 17% (18/105). Figure 17 maps the scientific rigour and industrial relevance values against the quality attributes. The purpose of this map is to provide an in-depth view of how the different quality attributes scored across the different scientific rigour and industrial relevance dimensions. As detailed above, the scientific rigour and industrial relevance extraction category applies ratings of 0, 0.5, or 1 to each of the seven sub-types of quality as per the work of Ivarsson and Gorschek [51]. To translate the distribution of ratings shown in Fig. 6 into single values, we have averaged the ratings for each quality attribute. For example, the quality attribute "understandability" is addressed in 12 primary studies and the ratings given for [rigour] validity are six 1's and six 0's, which averages to 0.5 (6/12) as shown in Figure 17.
Looking down the columns of the map (each of the rigour and relevance attributes), there are a few notable findings. All three scientific rigour attributes are fairly consistent across the quality attributes, with [rigour] validity showing some variance in lower values such as consistency, correctness, reusability, traceability, and redundancy. This is similarly true for [relevance] research method which has fairly consistent scores. The remaining industrial relevance subtypes have anomalies, such as [relevance] subject and [relevance] scale that have a very low or zero score for "undefined".
[Relevance] context has the most variance, showing low values for many of the initial and final quality attributes, and mid-tier values for attributes such as reusability, relevancy, verifiability, and traceability.

RE artefacts
We identified the smallest granularity of RE artefact that is required to perform the research. For example, if an algorithm performs analyses on one sentence at a time, then the type of RE artefact is "sentences". In another article, if the analysis requires a set of requirements (such as for duplicate requirement detection), then the type of RE artefact is  "requirements". Additionally, we extracted "functional" vs "non-functional" labelling of requirements in the primary studies. Figure 18 shows the identified types of RE artefacts, ordered by their evolution through the RE process from elicitation to maintenance 13 . The majority of primary studies, 67% (69/105), utilise the term "requirements", without being more specific. We found one study that reported on an elicitation scenario, 14 three that use word-level analyses, and eleven that utilise sentence-level analyses. Looking at the types of requirements artefacts, 9 primary studies focused on use cases, 15 on documents, and 2 on design artefacts. The design artefacts were a system model and mock-ups, both used to communicate perceived requirements captured by system design.
For the types of requirements addressed by each primary study, 32 addressed functional, 3 non-functional, 17 both, and 53 studies did not state the type. Half of the studies did not explicitly state whether the studies they reported dealt with functional or non-functional requirements.

RE activities
When conducting RE research, a specific RE activity is often targeted. Figure 19 shows the extracted RE activities from primary studies, as stated, from a closed set of labels. Figure 20 maps the quality attributes against the different RE activities. As the RE activities are ordered chronologically, the bubble plot provides an overview of how the focus on quality attributes is distributed over time (activity). Primary studies about the first four attributes (ambiguity, completeness, consistency, and correctness) are targeting the earlier RE activities, with peak attention on the analysis activity, and very little attention on the maintenance activity. Primary studies about complexity and traceability, in contrast, are more focused on later RE activities. Primary studies about some attributes such as reusability and verifiability are much more evenly spread across the RE activities.

Tools
Across the 105 primary studies, 43 created or used a previously created tool for the purpose of requirements quality. In total, we found 41 unique tools as listed in Table 7. For each tool, we list the quality attributes addressed, the primary studies and corresponding years, as well as the link to the tool, the licence, and who can access the tool. Only 15 tools could be found online. Figure 21 visualises the 41 unique tools according to the quality attribute that they address. Tools can address more than one quality attribute. The attributes in Figure 21 are ranked by their frequency in the primary studies. The top three quality attributes are addressed by a large percentage of the tools: ambiguity at 44% (18/41), completeness at 51% (21/41), and consistency at 44% (18/41). The remaining quality attributes are all addressed about the same number of times, at a median of four times each.

Empirical requirements quality research (RQ1)
There is a clear preference in empirical requirements quality research to offer improvements over definitions or evaluations (Fig. 5). While the logical research progression for any topic would be to define, then improve, and then evaluate, empirical requirements quality research has largely focused on improvements. It could be argued that this research field has reached saturation of definitions for each quality attribute and therefore has no need for further definitions. However, Fig. 8 only shows the 12 quality attribute themes, of which there are 111 unique quality attribute codes (see Fig. 6). For example, the quality attribute "ambiguity" has 32 unique types of ambiguity, and only six empirical definitions. Each of those quality attribute codes is likely a complex research topic that should be understood empirically. One reason for such a lack of definitions could be our exclusion of standards, books, and non-empirical research papers; however, our goal was to map requirements quality as understood empirically to present an evidence-based perspective. We believe future requirements quality research should focus on empirically understanding and defining the quality concepts being addressed. This could be done, for example, through interview studies with practitioners or analysing legacy requirements documents to characterise various quality attributes.
There is a preference in empirical requirements quality research to investigate ambiguity, completeness, and consistency, with correctness not far behind (Fig. 7). This could be due to the IEEE standards [8,9,13] in which these attributes are listed first 15 . This could also be due to the concept of "The Three Cs of Requirements" in which Zowghi and Gervasi argue that "there is an important causal relationship between consistency, completeness and correctness" [53]. The IEEE 29148-2011 and -2018 standards each have a separate section dedicated to "requirement language criteria" which lists "types of unbounded or ambiguous terms", potentially explaining why ambiguity has received such attention. From a requirements perspective, ambiguity could be interpreted as in direct conflict with the concept of documenting the formal needs for a system. This has largely been the motivating argument for primary studies that investigate ambiguity (e.g. [54][55][56][57][58]). However, there is also a discussion around the potential importance of ambiguity in requirements, particularly in the early activities of RE, where ambiguity could be "a powerful tool to discover tacit knowledge during requirements elicitation" [59]. Overall, there is a preference for researching these four primary quality attributes, and future research is needed on the others.

Research methods used (RQ2)
We observed a clear preference in empirical requirements quality research for case studies and experiments (including controlled experiments), as shown in Fig. 9 where these methodologies account for 93% (102/110) of all methodologies applied. A secondary study by Höfer and Tichy in 2007 also found that experiments and case studies were the top used research methods in empirical software engineering (SE) research: 37.6% of the primary studies report experiments, 28.6% case studies, and 33.8% other methods [60]. Our findings show a less diverse use of research methods in empirical requirements quality research, with 54.5% experiments, 38.2% case studies, and 7.3% other methods-i.e. a drop of 26 percentage points in "other" methods compared to Höfer and Tichy [60]. Empirical requirements quality research would benefit from more diversity given that each method has unique benefits and trade-offs [61].
We also found a clear preference of pre-intervention annotation and post-intervention annotation as sources of truth (73% in total as shown in Fig. 10). This manual work is often necessary, but over time should lead to shared truth sets in the community. Future empirical requirements quality research should aim to reuse truth sets to reduce the research workload, and align the research being conducted with comparable results.
There is also a clear distinction between study subjects and truth-set creators. Such distinction is from our interpretation of the reported research, and not explicitly described 15 "Correctness" is not listed in 29148-2011, but was again added in 29148-2018.

3
in the primary studies. We see a benefit in understanding the differences, selecting the appropriate type, utilising those differences in the research, and reporting them explicitly. Another finding from our investigation of participants in empirical requirements quality research is that many primary studies do not describe who their participants are in enough detail 43% (43/99), or at all 21% (21/99). The use of participants is necessary to validate research, but importantly, who the participants are and why they are being utilised is key to that validation. Our results show a deficiency in the reporting of participants, and this should be addressed in future empirical requirements quality research.
Our results suggest that empirical requirements quality research has fairly good scientific rigour, but low industrial relevance (Fig. 16). [Rigour] context and [rigour] study design have the strongest ratings at 70% for context and 72% for study design. Notably, half of the primary studies have a 1, and very few primary studies have a 0, which shows a high commitment to detailing the context and study design. Scientific rigour suffers the most from a lack of reporting the threats to validity. Figure 16 shows that for validity, primary studies either describe it in full, or do not mention it at all. Industrial relevance has the most room for improvement. Most notably, only 18 out of 105 primary studies report on research validated in an industrial context. Additionally, cross-checking the extracted data shows that half of those 18 primary studies did not use study subjects, meaning that they only utilised the industrial context for the labelling of data. Accordingly, only 9% (9/105) of primary studies applied their work in context. As described by Nuseibeh and Easterbrook: "modelling and analysis cannot be performed adequately in isolation from the organisational and social context in which any new system will have to operate" [62]. There is a need for empirically grounded requirements quality research in context, particularly in the industrial context.

Artefacts, activities, and tools (RQ3)
Empirical requirements quality research is primarily targeting the analysis activity, with a gradual drop off in either direction as visualised in Fig. 19. The mapping of RE activities against quality attributes in Fig. 20 reveals granular insights into these activities. Analysis, for example, shows an increased number of primary studies for the last four of five quality attributes. There is a deficiency of research in the management and maintenance activities. Overall, there is room for researchers to investigate RE activities more evenly, focusing their efforts on the initial and final activities of the RE process.
A majority of empirical research into requirements quality (67%) focused on the generic idea of "requirements", without considering other more specific forms. This gap shows much room for novel research on elicitation scenarios, use cases, and user stories, among many other types of requirements. The fact that we do not see more nontextual requirements types in this study might be because there has not been an explicit effort to improve the quality of those types of requirements. Alternatively, it could be because researchers working on, for example, requirements models did not explicitly use the term "quality" in their studies. 16 Future empirical requirements quality research should address a more diverse granularity of requirements artefacts, including alternative forms of requirements such as use cases, user stories, or feature requests in issue trackers.
As for functional vs non-functional, it appears that the preference is to research functional aspects of requirements over non-functional. This is likely due to the specificity of functional requirements, which opens them up to greater scrutiny when considering their quality. We recommend future research takes a closer look at non-functional requirements and how quality can be defined, improved, and evaluated.
The RE research community developed 41 requirements quality tools (15 of which are available online), which is a good sign of solution-based research. However, the majority of these tools are created new, and never used again. There is a strong need for empirical requirements quality research to publish archived versions of tools for future researchers to build on. The majority of requirements quality techniques being developed and researched do not need to develop their own tools from the ground up. An example of this is how researchers commonly visualise their quality recommendations as annotations: this could exist as a standard tool that requirements quality researchers use to visualise their output.
With respect to quality attributes, a majority of the tools cover ambiguity, completeness, and consistency. For researchers looking to conduct empirical requirements quality research into these quality attributes, it is highly recommended to look into existing tools first.

Research reliability
Reliability is defined as the degree to which different researchers, following our methodology, would produce similar results [63]. We take inspiration from Zhao et al. [16] in structuring our reliability section by primary phases of our SMS.
Ensuring the reliability of the article search is important for gathering all relevant articles. We followed the recommendations provided by Petersen [35] and Kitchenham [38] in designing the search phase, as well as in selecting the sources [16,37,64]. The primary study search phase consists of applying the search on databases, duplicate removal, and title skimming. For our database search, we used the indexing services Google Scholar and Web of Science. To increase reliability in our results, we also collected articles from the primary publisher databases of RE-related research (IEEE, ACM, Elsevier, and Springer) [16]. By employing the most common search strategy in systematic reviews in SE [35], we believe we mitigated any significant selection biases in the search strategy.
The reliability of the article selection phase is important to prevent false-negatives and avoid removing important articles. For the candidate selection activity, the articles were independently peer-labelled, and disagreements were carefully discussed according to the guidelines of Petersen at al. [35]. If there was uncertainty towards an article after discussion, the default decision was to include it, thereby mitigating the threat of false-negatives. The title selection, abstract selection, and PDF selection activities were performed by R1 alone due to the simplicity of the task. While the above three activities are grouped together with the candidate selection activity in Fig. 1, they are completely different tasks. The aforementioned three activities were a matter of visually identifying clear issues, whereas the candidate selection activity was a matter of academic inquiry and discussion. To increase reliability of these solo tasks, any uncertainty towards the inclusion of an article leads to it being included and therefore being kept for an upcoming peer-involved task.
The reliability of the data extraction phase is important to ensure the correctness of the extracted questions. The first four authors-without any hierarchy among them-carefully conducted the data extraction phase, with two authors independently doing a full-read of each article. Following the extraction, the researchers met to discuss and agree on the extracted questions. The fifth author oversaw the process and was included in the discussion in case of an ambiguous issue. We performed open-coding and thematic analysis on a few extraction questions, as described in Sect 3.2.3. The original data and labels are available in our replication package.
The reliability of the mapping phase is important to ensure that the interpretation of the data is objective and aligned with the original intent behind the extracted data. All five authors were involved with the interpretation of the data, and transformation of the raw data to be reported in the form of descriptive statistics, tables, and figures. This checking between authors leads to several changes in how the data were transformed and interpreted based on internal feedback rounds. Additionally, the original data, tables, and figures are available in our replication package.

Threats to validity
Construct validity The primary threat to construct validity is in the data extraction phase. Agreement among the authors was reached before beginning the full extraction run, and a trial extraction run was conducted for each researcher involved in this activity to align their understandings of the questions to extract. This effort resulted in changing extraction question names, descriptions, coding styles, and extraction styles to meet the understanding of the researchers. Thereafter, the process of extracting the questions involved very few changes during the full run.
Internal validity The design of this SMS follows the recommendations of standard articles outlining how to conduct an SMS in SE research, and we carefully followed those recommendations. The data presented are available in our replication package to be checked by anyone interested, increasing the likelihood that critical errors in reporting (if any) can and will be caught.
External validity Systematic studies are designed to be representative across the dimensions of the research area being investigated. While we are unable to fully exclude the potential threat of possibly missing primary studies, we have done our best to ensure the rigour of our work and thus feel confident about the generalisability of our claims in the specific research area of empirical research on requirements quality. When applying our results beyond this specific scope, caution must be used. While sub-communities can be representative of their broader community, this claim cannot be made without evidence sampled specifically from that broader community. We leave it to tertiary research to take our claims about requirements quality and look for similar patterns in the broader RE-and potentially SE-communities.

3 7 Conclusions
This SMS offers an overview of the field of empirical research on requirements quality. We analysed the quality attributes researched in previous published work, the research methods used, and the studied RE activities, artefacts, and tools. Our detailed reporting and replication package can be used to understand cross sections of the data.
We found a broad set of researchers publishing at specific venues such as RE, REFSQ, and REJ. The majority of primary studies focus on researching improvements to requirements quality, with very few focusing on definitions or evaluations. Most frequently studied requirements quality attributes are ambiguity, completeness, consistency, and correctness.
We found that empirical requirements quality researchers primarily use experiments and case studies. For participants, we found a distinction between study subjects and truthset creators. More studies utilised industry participants than students; however, the number of participants in student studies was three times higher than industry studies. We also observed an issue with the number of primary studies reporting incomplete information about their participants. Finally, the scientific rigour and industrial relevance analysis revealed that rigour is quite good (with the exception of validity), but relevance is low.
As for the particular RE focus, we found that empirical requirements quality research has primarily focused on "requirements", with little attention given to other artefacts such as use cases, user stories or issues in issue trackers. The focus was also on the specification and analysis activities, with less attention on elicitation, validation, and verification, and very little attention to management and maintenance. Finally, we found 41 unique requirements quality tools, many of which contain no reference on how to find them and only three of them contained a visible license.
Our thorough results and detailed discussion serve as a starting point for future empirical requirements quality researchers. We hope that future research will take our insights into consideration as this field continues to grow.
Supplementary information All supplementary information can be found in our replication package 17