1 Introduction

Software Tests encode important knowledge about typical usage scenarios and inputs, corner cases, exceptional situations as well as the intended output and behavior of the software system. Consequently, test cases play an important role for assuring software quality and also for the evolution of software systems as a knowledge source that supports communication in the team and with customers Latorre (2014) and for specification and documentation (Ricca et al. 2009). At the same time, the evolution of software systems requires that test cases are frequently updated and extended, resulting in effort for corresponding test case evolution and maintenance activities (Moonen et al. 2008; Zaidman et al. 2011).

Test Code Garousi and Felderer (2016), i.e., the form in which executable automated tests are commonly available, is the basis for many downstream activities such as maintaining and refactoring tests, locating faults, debugging, analyzing and comprehending test results, repairing broken tests, or dealing with flakiness (Garousi and Felderer 2016). In all these scenarios, developers and testers have to repeatedly read and understand test code – a usually time-consuming manual task, which makes readability and understandability critical factors when it comes to the quality of a project’s test cases (Kochhar et al. 2019; Setiani et al. 2021a).

Readability as well as legibility and understandability of source code have already been subject to a series of empirical studies, which were recently examined in a systematic literature review by Oliveira et al. (2020). Test code has many properties in common with source code of software programs, and tests are often written using the same programming languages as the system under test. Nevertheless, the development of test code also shows significant differences when compared to other code. There exist dedicated frameworks and patterns for implementing test code (Meszaros 2007) and, furthermore, tools for generating tests (e.g., Evosuite, Randoop, IntelliTest) are becoming increasingly popular (Ramler et al. 2018).

In this paper, we focus on investigating the readability of software test code by combining scientific and practical views.

In a first step, we build on a Systematic Mapping Study (SMS) approach (Petersen et al. 2015) to identify characteristics, influence factors, and assessment criteria that have an impact on the readability of test code. In a second step, we complement the mapping study with grey literature to include practical views. Based on identified influence factors, we conducted a controlled experiment (Wohlin et al. 2012) to investigate the perception of readability and understandability in academic environment.

In an initial mapping study (Winkler et al. 2021) we reviewed the scientific literature dedicated to the readability of test code, exploring (a) the demographics of the body of knowledge, (b) the characteristics of the studied test code, and (c) the factors that have shown to impact readability. We analyzed 19 scientific studies filtered from several hundred search results and identified a set of 9 influence factors that have been investigated in academic work either individually or as part of comprehensive readability models. Our mapping study covers the topic of test code readability specifically from the viewpoint of work published in the scientific literature. However, test code readability is of high practical relevance and the topic is therefore also frequently covered in magazine articles, books on testing, and online blogs. These sources are typically referred to as grey literature (Garousi et al. 2019). Based on previous work (Winkler et al. 2021), the goal of this work is to extend the topic by exploring test code readability in it’s entire scope by combining the scientific and the practical viewpoint. We therefore conducted an additional grey literature survey to identify influence factors commonly discussed in practice, we mapped the results to the findings from the previous scientific literature study. Finally, we investigated the newly identified factors in a controlled experiment (Wohlin et al. 2012) comparing the readability of different versions of a selected set of test cases. The main contribution of the paper includes:

  1. 1.

    Identified influence factors for the readability of software test code based on a systematic mapping study (SMS) as a combination of academic and practical views, derived from academic and grey literature. Detailed analysis results are available online (Winkler et al. 2023)Footnote 1.

  2. 2.

    Setup and results of a controlled experiment to investigate influence factors of a selected set of test cases in an academic environment (Winkler et al. 2023).

Consequently, the remainder of this paper is structured as follows: Section 2 describes background and related work on test code quality and code readability. Section 3 defines our research questions, followed by three sections explaining the setup, process and results of the systematic mapping study (Section 4), the grey literature survey (Section 5), and the concluding experiment (Section 6) in context of the respective research questions. Finally, Section 7 summarizes the finding, discusses implications for academia and practitioners and limitations, and identifies future work.

2 Background

The readability of test code is associated to two areas of related research: First, the area of test quality or, more specifically, the quality of code of automated tests (cf. Section 2.1). Second, the related research on source code readability (cf. Section 2.2).

2.1 Software Test Code Quality

In context of software evolution and maintenance, changes made to the software due to bug fixes, extensions, enhancements, and improvements, also require subsequent adaptations of the corresponding tests (Yusifoğlu et al. 2015).Thereby, similar to code quality being an important factor for supporting evolution and maintenance, test code quality is critical for evolving and maintaining tests. Consequently, in test code engineering (Yusifoğlu et al. 2015), the two leading activities are quality assessment and co-maintenance of test-code with production code.

Engineering test code, much like engineering production code, is a challenging process and prone to all kinds of design and coding errors. Hence, test code also contains bugs, which may either cause false alarms (i.e., a test fails although the production code is correct) or which may cause “silent horrors" (i.e., a test passes although the production code is incorrect). Both kind of bugs have been found to be prevalent in practice (Vahabzadeh et al. 2015). The latter kind of bugs are also considered a result of “rotten green tests" (Delplanque et al. 2019), which are tests that pass green but do so by inadequately validating the required properties of the system under test.

Apart from bugs in tests, a widely reported problem related to test code quality are test smells (Garousi and Küçük 2018; Spadini et al. 2018; Tufano et al. 2016). Test smells are the equivalent to code smells (Lacerda et al. 2020) (or anti-patterns) in production code, which are symptoms of an underlying problem in the code (e.g., a design problem) that may not cause the software to fail now but bears the risk of causing additional problems and actual bugs in future. Hence, test smells can be considered as poorly-designed tests (similar to rotten green tests) and their presence may negatively affect test suites with respect to the maintainability and even the correctness of the tests  (Bavota et al. 2015; Spadini et al. 2018). Although test smells are a popular concept that is frequently investigated in scientific literature, a recent study by Panichella et al. (2022) suggests a mismatch between the definition of test smells and real problems in the tests. To tackle this mismatch they update definitions of test smells and investigate issues which are currently not covered well by the existing smells.

Since the upcoming of test code generators like Evosuite, which aim to generate test suites covering the whole system under test, there is a continuous discussion on improvements and practical usefulness of these tools. For example McMinn et al. (2012) leverage web search engines for generating test data of type String. This approach can improve the coverage of the resulting test suites and the use of more realistic input strings could improve the readability of the test code. Various studies (Fraser et al. 2013; Ceccato et al. 2015; Shamshiri et al. 2018) investigate the usefulness of test code generators in debugging activities and also highlight shortcomings of generated test code which relate to the high number of assertions, absence of explanatory comments or documentation, quality of identifiers and in general unrealistic test scenarios. Hence, similar to the quality of automatically generated code (Yetistiren et al. 2022; Al Madi 2022), the quality of generated test code is a critical aspect that requires additional consideration and investigation.

2.2 Readability of Source Code

Reading and understanding source code is an elementary activity in software maintenance and development (Minelli et al. 2015). Hence, code readability has been subject to a wealth of empirical studies; e.g., Oliveira et al. (2020) examining 54 papers on code readability in their literature review. In these studies, a wide range of different factors influencing readability have been investigated, including code formatting and indentation, identifier naming, line length, complexity of expressions, complexity of the control flow, use of code comments, presence of code smells, and many others.

Buse and Weimer (2008) developed a model combining aforementioned factors to automatically estimate code readability. They trained their model on small source code snippets extracted from open-source projects and tagged as readable or non-readable by human annotators. Following this approach, generalized and extended code readability models have been developed in subsequent works, e.g., by Posnett et al. (2011) and Scalabrino et al. (2017).

Buse et al. define readability as “a human judgment of how easy a text is to understand" (Buse and Weimer 2008). However, no generally accepted definition for the term readability has been established in the literature and, thus, readability is often used in combination with or as synonym for the related terms legibility and understandability. The term legibility is rather related to the visual appearance of the source code affecting the ability to identify individual elements (Oliveira et al. 2020), while the term understandability is mainly related to semantic aspects of the code. Scalabrino et al. (2017) even developed a model specifically dedicated to the understandability of source code, arguing that program understanding is a non-trivial mental process that requires building high-level abstractions from code to understand its purpose, relationships between code entities, and the latent semantics, which all cannot be sufficiently captured by readability metrics alone.

Fig. 1
figure 1

Concept of readability as used in context of our study

Based on the terms and definitions commonly used in related work, we adopt a broader view on the concept of readability embracing all three terms – readability, legibility and understandability – in our work in context of test code readability. Hence, in the remainder of this paper, we use the term readability subsuming all related notations since it is the most commonly used term in the software engineering literature.

Figure 1 depicts this view and the distinction between factors (e.g., test case length) influencing readability as well as upstream activities (e.g., test case maintenance) being affected by readability. The underlying concept is related to activity-based quality modeling as proposed in Quamoco approach (Wagner et al. 2015, 2012) and used, e.g., for modeling maintainability (Deissenboeck et al. 2007) or requirements quality (Femmer et al. 2015). Readability is described by more fine-grained factors that can be assessed (manually or automatically) and it has an observable impact on the activities performed by stakeholders related to a specific entity. In our context, the entity of interest is the test code, shown as a subset of source code.

3 Research Questions

Based on the goal of this article to investigate the readability of software test code by combining scientific and practical views, we identified three groups of research questions, with focus on (a) influence factors in academia, (b) influence factors in practice, and (c) an investigation of combined influence factors on a selected set of test cases in a controlled experiment.

3.1 Influence Factors in Academia

Based on our previous work, an initial Systematic Mapping Study (SMS) Winkler et al. (2021), we extended the mapping study by introducing additional analysis criteria. Therefore, we identified the first research question (RQ1) and two sub-research questions to (a) identify influence factors (RQ1.1) and (b) to explore methods (RQ1.2) used in scientific studies. We applied the Systematic Mapping Study (SMS) approach, proposed by Petersen et al. (2015). Section 4 describes the research protocol and the results of the mapping study.

figure d

3.2 Influence Factors in Practice

To systematically capture influence factors, discussed in industry and practice, we extended and complemented the mapping studies with grey literature. The results will show similarities and differences of academia and industry in context of the readability of software test code. For investigating grey literature, we followed the guidelines proposed by Garousi et al. (2019). Section 5 presents the research protocol and the results.

figure e

3.3 Investigating Influence Factors in a Controlled Experiment

Based on identified factors, we conducted a controlled experiment in academic environment to investigate the perception of readability and understandability of a selected set of test cases (derived from open source projects). We build on the guidelines, proposed by Wohlin et al. (2012) for planning, executing, and reporting on the controlled experiment. Section 6 presents the experiment setup and reports on the results.

figure f

4 Systematic Mapping Study

To investigate the Influence Factors on Readability in Scientific Literature, we conducted a Systematic Mapping Study (SMS) based on Petersen et al. (2015).

In this section, we summarize the study protocol and the results from the systematic mapping study (SMS) (Winkler et al. 2023). We present influence factors and study types with focus on research methods used.

4.1 Study Protocol and Process

This section summarizes the study protocol with focus on the systematic mapping study of scientific publications.

Fig. 2
figure 2

Systematic mapping study process and amount of received publications

An integral part of systematic mapping studies as proposed by Petersen et al. (2015) is the thorough documentation of the process to make the results traceable. Figure 2 provides an overview of the whole process. After the initial search (Step 1) and filtering (Step 2), we apply back- and forward snowballing (Step 3) and filter (Step 4) to obtain our final set of studies. We repeat the steps 3 and 4, the back- and forward snowballing, until exhaustion, i.e., they add no new relevant studies to the result set. In our case, no additional publications were identified in the second iteration. The following subsections provide details on each of these steps.

Table 1 Search strings in different databases

Step 1: Apply Search

Based on the research questions (cf. Section 3, RQ1), we defined the following keywords: test, code, model, readability, understandability, legibility and smell. We used them to build the Search Strings shown in Table 1. The queries were performed on established sources for scientific literature, i.e., Scopus, IEEE and ACM, and we filtered the studies based on title, abstract and keywords. In the ACM search we enclosed the term “understandability" in double quotes in the abstract filter to enforce exact matching, because ACM’s fuzzy matching leads to a high number of irrelevant results. For ACM we searched in the ACM Guide to Computing Literature which offers a larger search space than the ACM Full-Text Collection. The search was conducted at the end of November 2021 without limiting the publication year and returned a total of 1232 raw results (Scopus: 460, IEEE: 231, ACM:541). Based on the merged results, we proceeded to the next step.

Step 2: Deduplicate & Filter Results

We first deduplicated the raw results based on the digital object identifier (DOI) and title, which removed 281 studies. Next, we imported the result set into a spreadsheet solution for applying inclusion and exclusion criteria.

Inclusion Criteria. We included a study if both of the following criteria were fulfilled:

  • Conference papers, journal/magazine articles, or PhD theses (returned by ACM)

  • Readability, understandability or legibility of test code is an object of the study

Exclusion Criteria. We excluded a study if one of the following criteria is applicable:

  • Not written in English

  • Conference summaries, talks, books, master thesis

  • Duplicate or superseded studies

  • Studies not identifying factors that influence test code readability

The criteria were evaluated based on title and abstract of the results by at least one of the authors. When in doubt about including or excluding, the evaluated study was discussed with a second evaluator. This step left us with 11 scientific publications.

Backward & Forward Snowballing - First Iteration

Based on the initial iteration, we executed backward & forward snowballing to identify relevant studies that have not been identified in the initial search.

  • Step 3: Backward & Forward Snowballing. Since relevant literature might refer to further important studies, we used the references included in the 11 studies for backward snowballing via Scopus. The 11 studies might also be cited by other relevant studies, hence we also performed forward snowballing, by using Scopus to find studies, which cite one of the initial 11 studies. This increased the result set by 330 from backward snowballing and 174 from forward snowballing to a total of 515 studies.

  • Step 4: Deduplicate & Filter Results. By comparing these 515 studies with the initial result set we found and removed 83 duplicates. Similar to step 2, one of the authors of this paper applied the inclusion and exclusion criteria. Additionally, after a full text reading, all studies were discussed and reevaluated by the author team. With this, we reduced the result set by 496 and obtained a final result of 19 studies.

Backward & Forward Snowballing - Second Iteration

We performed a second iteration of back- and forward snowballing via Scopus with these 19 studies as input. This returned a raw result of 825, which we reduced to 357 studies by removing duplicates. We applied in- and exclusion criteria on the remaining studies, which removed all 357 studies. Therefore, this second iteration did not add any new relevant studies to the results set of 19 studies.

Table 2 Final set of publications based on the search process

Studies Not Included

In the following, we provide four exemplary cases filtered out in step 2 and the rationale why these studies did not meet the criteria for inclusion in the final publication set after discussion by all authors: Grano et al. (2020) focus on semi-structured interviews with five developers from industry and a confirmatory online survey to synthesize which factors matter for test code quality. Although readability is seen as a critical factor by all participants, the analysis of readability and influencing factors was not in the scope of this work. Tran et al. (2019) investigated general factors for test quality by interviewing 6 developers from a company. Quality factors are discussed with natural language tests brought by the participants. Since our work has its specific focus on test code, readability of natural language tests was not considered further. Bavota et al. (2015) report on four lab experiments with an overall number of 49 students and 12 practitioners and effects on maintenance tasks from eight test smells. These test smells occur frequently in software systems. While this work clearly shows that test smells negatively affect correctness and effort for specific maintenance tasks, a connection between test smells and readability is not shown. Deiß (2008) reports on a case study about semi-automatic conversion and refactoring of a TTCN-2 test suite to TTCN-3. Discussed improvements included reducing complicated or unnecessary code artifacts generated by the automatic conversions that are also supposed to increase readability. We excluded this study since the focus was the migration from TTCN-2 to TTCN-3 and the study did not investigate the improvements on the readability of test code.

4.2 Systematic Mapping Study Results

This section summarizes the findings in context of influence factors on readability found in scientific literature (Table 2).

4.2.1 Which influence factors are analyzed in scientific literature (RQ1.1)?

In RQ1, we explore the factors that have been found to impact readability of test code.

Influence factors have been derived by one of the authors based on the content of the paper. Other authors and testing experts have reviewed the initially identified factors. Deviations have been discussed by all authors to come to a consensus. Table 3 maps candidate factors to the studies that investigate them. Two approaches of how influence factors are considered in the primary studies can be distinguished. Studies either (a) investigate the impact of one or more individual factors, often related to the attempt to improve readability with a specific approach or tool, or (b) they target readability models constructed from a combination of many factors. The majority of the primary studies (see Table 3a) consider individual factors. Readability models were subject to study only in three instances (Table 3b), although such models are commonly used in the general research on source code readability.

Table 3 Reported factors influencing test code readability

We identified a total of 9 unique influence factors in the scientific literature as shown in Tables 3a and b. In the following, we briefly explain these factors. The number in brackets shows the number of primary studies including the particular factor combining counts from both tables.

Test names (6)

The name of the test method or test case. Not only generated tests have poor names but also names provided by humans often convey few useful information. Therefore, several studies propose different solutions on automatic test renaming e.g., Zang et al. [A19] Roy et al. [A16] or Daka et al. [A6]. In studies from e.g. Setiani et al. [A17], Bowes et al. [A4] or Panichella et al. [A15] participants agree on the importance of test names for readability.

Assertions (5)

This factor relates to the amount of assertions in a test case as well as to assertion messages. Although assertions are an integral part of test code Daka et al. [A5] report minor influence on readability coming from the amount of assertions in a test. Nevertheless the amount of assertions is still used in their readability model and also in the model from Setiani et al. [A18]. In the survey from Setiani et al. [A17] assertions are mentioned to have an influence, but other factors like naming are deemed more important. In Almasi et al. [A2] developers issued concerns on generated assertions. Leotta et al. [A10] find no significant effects on readability when AssertJ assertions are used instead of basic JUnit assertions, although other positive effects could be observed.

Identifier names (5)

Naming of variable names in test cases. Especially Lin et al. [A13] investigate this factor thoroughly and also provide characteristics of good and bad identifier names based on a survey. Roy et al. [A16] propose an automatic way for identifier renaming in test cases. In studies from e.g. Setiani et al. [A17] and Bowes et al. [A4] participants agree on the importance of identifier names for readability. Fisher and Johnson [A7] attribute differences between generated and manually written tests to differences in naming.

Test structure (2+3)

Structural features are found in studies investigating individual factors (2 times) as well as in readability models (3 times). They include strucural features of test methods like maximum line length, number of identifiers, length of identifiers, number of control structures (e.g., branching as mentioned in Bowes et al. [A4], test length, etc. Participants in the study from Setiani et al. [A17] agree on the importance of the amount of lines of code in the tests. These features are also used in combination by automatic readability raters, e.g., from Daka et al. [A9], who propose a rater especially for test cases, Grano et al. [A9] or Setiani et al. [A18].

Test data (4)

Testers often have to evaluate data used in assertions to decide if a test has truly failed or if there is a fault in the test. Afshan et al. [A1] investigates this topic and shows that readable string test data helps humans predicting correct outcome. Alsharif et al. [A3] and Almasi et al. [A2] highlight the importance of meaningful test data. Furthermore, in the workshop study from Bowes et al. [A4] developers, amongst others, state that magic numbers are detrimental to readability.

Test summaries (4)

Documentation describing the whole test case support understanding what the tests do, for example as Javadoc like in Roy et al. [A16] or Li et al. [A11] or interleaved with test code like in Panichella et al. [A15]. Li et al. [A12] reduce the amount of generated description, by only adding test stereotypes as tags.

Dependencies (3)

The number of classes a single test case depends on, as proposed by Fraser et al. [A8], or if a test is truly a unit test and therefore independent from other parts of the system as reported by Setiani et al. [A17]. Test coupling and cohesion discussed by Palomba et al. [A14] describe dependencies between tests and are included in this factor.

Comments (2)

Single comments in test code providing useful information. According to Fisher and Johnson [A7], one of the differences between their generated tests and human tests is the lack of explanatory comments. In Setiani et al. [A17], survey participants also mention comments being to some degree important to readability.

Textual features (1)

Textual features focus on natural language properties part of test cases like consistency of identifiers or identifiers present in a dictionary. These features can be easily computed and are therefore frequently used in readability models and automatic readability raters like in Scalabrino et al. (2016).

figure g

4.2.2 Which research methods are used in scientific studies (RQ1.2)?

In RQ1.2, we give an overview on the study types and the used methods. This analysis is based on the classification of established empirical research methods involving human participants Wohlin et al. (2012). Although software tools for investigating the readability of software code exist, the readability of software tests is not in the main focus of these approaches.

Concerning the utilized types shown in Table 2, most studies (15) report an experiment which is combined with a survey in 7 studies. Human involvement is quite common, in 16 from 19 studies humans take part in experiments, surveys or play another role as participants of the study. Next, we present details on the individual types of studies.

Experiment (15)

12 of 15 studies evaluate the effect of an approach with humans by either asking participants to answer questions to a given test case or code snippet without knowing the origin like in Roy et al. [A16] or Daka et al. [A6] or participants have to choose between two versions (forced choice) like in Setiani et al. [A17] or Daka et al. [A5]. Alsharif et al. [A3] enhance their experiment by letting some participants vocalize their thoughts while filling out a questionnaire in a Think Aloud Study. Li et al. [A12] do not fit in this categorisation. They use an indirect approach to measure the effect of generated tags by letting one group write summaries of test cases with and without treatment. Another group then rates these summaries according to a scheme. The difference in the ratings shows the effect of the treatment. For analysing the experiments results, eight from 15 studies use a form of the Wilcoxon test, most commonly the Wilcoxon rank sum test. Furthermore, these studies report the effect size with the Vargha-Delaney (\(\hat{A}_{12}\)) statistic or Cliff’s Delta. Three of these studies also use the Shapiro-Wilk test for normal distribution to decide if a parameterized test can be applied. Alsharif et al. [A3] use a Fisher’s Exact test on their results. The remaining studies interpret the results without statistical tests.

Survey (8)

Five out of eight studies use online questionnaires, one uses an off-site questionnaire and for one study the kind of survey could not be extracted. In the surveys six out of eight studies use Likert scales often for rating readability. Free text answers are also common for optionally elaborating on a rating or as a mitigation against random readability ratings like in Daka et al. [A6].

User study and Prototype (3)

The three studies of these types use surveys with Likert scale in Lin et al. [A13], forced choice questions with opportunity to elaborate on the rating in Zhang et al. [A19] or a mixture of multiple- and binary-choice and open questions in Li et al. [A12]. Zhang et al. [A19] use the Wilcoxon test for comparing the results of a prototype tool with other tools after a test on normality with the Shapiro-Wilk test.

Concept Paper (1)

Bowes et al. [A4] brainstorm and discuss quality evaluation of software tests with industry partners. Afterwards they merge the result with their own teaching experience and relevant scientific literature and books on software testing.

figure h

5 Grey Literature Review

In this section, we first describe the study protocol and process for the grey literature analysis followed by presentation of the results including a discussion of the respective research questions. The data set is available online (Winkler et al. 2023).

5.1 Study Protocol and Process

The process for conducting the review of grey literature (Fig. 3) is similar to the scientific literature review, except that there is no backward snowballing. The guidelines and recommendations by Garousi et al. (2019) were used as input for this part of our work. We decided to add grey literature to this work, because testing is frequently performed by practitioners, and we assume that for them the internet is one of the first places used for information gathering and sharing.

Fig. 3
figure 3

Grey literature review process and amount of received grey sources

Step 1: Apply Search

Based on the research questions and knowledge obtained from the previous literature search we used the search strings “test code" readability and “test code" understandability. We performed these queries separately on Google using a script for extracting all results. The script mimics a search without being logged in with a Google account. Therefore personalized search results should be reduced to some degree. In contrast to Googles prediction of hundreds of thousands of results, it returned 146 results for "test code" readability, 101 for "test code" understandability and 167 for "test code" legibility (total: 414) in mid-February 2022.

Step 2: Deduplicate & Filter Results

We first deduplicate the results by comparing the links which removed 18 sources. The result set was imported into a spreadsheet for applying inclusion and exclusion criteria.

Inclusion Criteria. We included a study if the following criterion was fulfilled:

  • Readability or understandability of test code is a relevant part of the source. This is the case if the length of the content on readability is sufficient and if the source contains concrete examples of factors influencing readability.

Exclusion Criteria. We excluded a study if one of following criteria applied:

  • Not written in English

  • Literature indexed by ACM, Scopus, IEEE

  • Duplicates, videos, dead links

The criteria were evaluated based on the contents of the source. This step left us with 62 results ready for further analysis and extraction of influence factors.

Excluded Sources. Similar to the scientific literature search, we provide some examples and rationale for sources that where excluded when applying the defined criteria: Source KarhikFootnote 2 is a blog entry, which is relatively short and primarily lists features of AssertJ. Although the entry mentions readability improvement by using AssertJ in one sentence, it gives no explanation for this claim. Source Karlo SmidFootnote 3 discusses the DRY-principle (don’t repeat yourself) in context of unit testing. However, the blog entry is very short and primarily references to another source already present in the result set [G59]. Although the collaborative source OpenstackFootnote 4 has a reasonable size and it also has a section on readability, the statements are too generic and do not contain a concrete influence factor on readability. Finally there are also many sources which are off-topic, e.g, because they discuss general code readability or quality, they describe advantages of unit testing, or they are documentation pages of test frameworks.

5.2 Grey Literature Analysis Results

In the following, we present the results from our further analysis of the grey literature sources with regard to factors influencing readability, and we provide answers to the research questions RQ2.1 and RQ2.2.

Infuence factors. We identified 12 types of influence factors in the analysis of the gray literature. The factors are related to test structures (Str), test names (TeN), assertions (Asse), helper structures (Help), dependencies (Dep), identifier names (IdN), fixtures (Fix), DRY principle (DRY), test data (TeD), comments (Com), domain specific language (DSL), and parameterized test (Par). A detailed description of each factor is provided in Section 5.2.1.

The influence factors were extracted from the literature sources by one of the authors by tagging each source with keywords, which are mentioned in the context of test code readability. Keyword mentioned in a different context were not included. For example, in [G49] the use of “helper methods" is only mentioned in context of easier maintenance, therefore this appearance of the factor helper methods is not counted. The results were cross-checked and discussed by the other authors of the study.

In our analysis, we also investigated what types of gray literature sources we analyzed, when the literature sources mentioning the influence factors were published, and in context of what programming languages readability was discussed.

Source types. Figure 4 shows the identified types of grey literature source. From the 62 sources around 75% (46 in total) are identified as blog entries of various sizes. A source is also identified as a blog when there is no clear indication that an editorial team is involved. The types of the remaining 16 sources are spread across 5 books, 5 other types (stackoverflow, quora, wiki, cheatsheet, podcast), 3 magazines, 2 presentations (slide shows), and 1 Phd thesis.

Fig. 4
figure 4

Types of analyzed selected grey sources

Factor across years. Figure 5 shows the factors investigated by the blogs across the years. The bottom line Sources per year gives the number of sources in a particular year which investigated the factors above. Apart from parameterized tests, which appeared only seven times in the years 2020 and 2021 and in fewer sources in 2017 and 2018, there are no obvious fluctuations in the distribution of factors. Table 4 shows the selected sources ordered by years descending and the investigated factors in detail, where these effects are also visible.

Fig. 5
figure 5

Factors investigated by grey literature. The bottom row gives the total number of sources per year, which may cover multiple factors

Programming languages. Concerning programming languages, 19 sources mention Java or use Java code snippets, C# appears in 10, Java Script in nine sources and, Ruby in three sources. Kotlin and Python each appear in two sources, Scala, Typescript, C++ and Go are mentioned in one source each. Some sources do not mention a certain programming language or do not use code snippets, because they provide general best practices for testing. This is in accordance with the findings of our previous SMS (Winkler et al. 2021), where Java is the dominant language used in studies on test code readability.

5.2.1 Which influence factors are discussed in grey literature (RQ2.1)?

In the following, each of the 12 influence factors identified in the gray literature analysis is described in detail. The number in brackets shows how many of the 62 reviewed literature sources mention the factor. They range from 28 (45%) to 8 times (13%). Table 4 lists the analyzed grey literature sources (rows) and shows in which of these sources the identified influence factors (columns) are mentioned. For the sake of completeness, the table shows all 14 influence factors identified in the scientific as well as in the gray literature search, which includes two factors not mentioned in the gray literature.

Table 4 Factors influencing readability mapped to sources from the grey literature search

Test structure (28)

(Str): 23 out of 28 sources suggest the use of patterns like Arrange, Act, Assert [G31], Given, When, Then [G18] or Build, Operate, Check [G54]. Two sources ([G35] and [G25]) suggest to group similar test cases to see differences more quickly. Other sources [G52][G22] suggest to watch out for “eye-jumps", e.g., a variable, which is initialized many line breaks away from its usage. The absence of logic, shortness, and coherent formatting of test cases is also mentioned by several authors.

Test names (26)

(TeN): All sources suggest coherent naming of test cases and most of them suggest a concrete naming pattern like givenFooWhenBarThenBaz [G3] or subject_scenario_outcome [G57]. Three sources ([G31], [G13] and [G28]), explicitly suggest to use spaces in test names, which is a practice also shown by others in code examples, e.g., [G57] and [G5]. Long names are explicitly okay for two sources, since these methods are not called in other parts of the code. Different opinions exist on the inclusion of the name of the concrete tested method in the test name. [G20], [G34] and [G52] suggest to include the method name in the test name. Other sources like [G41], [G11] and [G9] do not recommend to include the method name, because if the method name changes the test name has to change too. Instead the tested behavior should be described.

Assertions (24)

(Asse): The use of appropriate assertions or custom assertions is suggested in eleven sources, e.g., [G9] and [G3]. Nine sources mention assertion libraries like AssertJ (Java) or FluentAssertions (C#) since they enable a more natural language style for asserting properties and contain additional assertions for collection types [G50][G18]. Four sources stress the importance of assertion messages for debugging. Concerning the amount of assertions, the rule “one assertion per test” is mentioned by, e.g., [G31] and [G25].

Helper structures (23)

(Help): 13 sources recommend helper methods in order to hide (irrelevant) details like creating objects or asserting properties [G27][G19]. The Builder Pattern (or similar patterns) are used by six sources for creating the objects under test, e.g., [G45][G61]. Inheritance of test classes is seen critically by some authors, e.g., [G53][G62].

Dependencies (19)

(Dep): All 19 sources agree that one test should only test one functionality or behavior. This affects readability positively, because the test stays short and the test name can be more descriptive, since only one behavior has to be described. Four sources highlight to only assert properties which are absolutely necessary for the functionality described by the test name and to resist the urge to check additional properties.

Identifier names (17)

(IdN): While nine sources only give generic information (e.g. should have meaningful or intention revealing names), other sources provide detailed recommendations suggesting, e.g., to either prefix variables with expected and actual [G28] or use names like testee, expected, actual [G11].

Fixtures (17)

(Fix): Although 15 of the 17 sources use fixtures, sometimes in combination with setup methods, two sources [G62][G52] argue against the use of fixtures, because they are not visible in the test itself and may contain important information. Similarly, [G28] points out that moving reusable test data into a fixture forces the reader to jump between two locations. Finally, [G21] suggests that fixtures should only be used for infrastructure and not for the system under test, and [G15] recommends to use them only for properties which are needed in every test case.

DRY principle (16)

(DRY): In the sources which mention the Don’t Repeat Yourself (DRY) principle, there is an agreement that strict adherence to this principle hides away information important for understanding test cases. Others favour the Descriptive And Meaningful Phrases (DAMP) principle [G11] or to find a balance between these principles. As a combination of both, two sources [G36][G53] suggest to clearly show what a test does (DAMP), but to hide how it is done in a helper method (DRY).

Test data (15)

(TeD): Five authors suggest to avoid literal test data (a.k.a. magic values), instead local variables, constants or helper functions should be used to provide additional information, e.g., [G37][G34]. However, [G28] argues that declaring local variables for this purpose can quickly increase the test size and the mapping between variable and actual value has to be kept in mind when reading the test. Similarly, [G15] states that using literal values instead of variables sometimes improves readability. Finally, test data should be production-like and simple, and one author also recommends to highlight important data.

Comments (14)

(Com): Eleven sources use comments in their snippets or mention them in the text to highlight Arrange, Act, Assert or similar structures. However, this is not a strict rule for every author, e.g., source [G57] uses empty lines as an alternative or [G18] mentions to use comments with respect to the capabilities of the testing framework. If the framework already provides such structural hints then comments are unnecessary. Common code comments are mentioned by three sources with the general advice to avoid them, e.g., [G25].

Domain specific language

(8) (DSL): In order to make tests more readable also for non programmers, some sources, e.g., [G38][G42], suggest using helper functions or Gherkin (applied in Behavior Driven Testing with Cucumber) as domain specific languages. Such languages describe the executed behavior in natural language and, thus, hide the execution details.

Parameterized test

(8) (Par): Eight sources suggest to use parameterized tests (aka data-driven or table-driven tests) to reduce code duplication. This is also suggested by authors who are not in strict favor of the DRY principle, e.g., [G28].

figure k

5.2.2 What is the difference between influence factors in scientific literature and grey literature (RQ2.2)?

From the total 12 factors identified in the grey literature review, 7 were already known from the scientific literature, while 5 factors were only found in the grey literature. These factors are new and did not appear in our previous white literature study Winkler et al. (2021). In the scientific literature review we identified a total of 9 factors. It included two factors, which were mentioned only in the scientific literature but not in the gray literature. In the description below, the numbers in the brackets (A vs B) indicate how often a factor was found in the scientific literature versus in the grey literature.

However, even if factors have been found in both sources, the specific view on a factor can sometimes vary between white and gray literature. For example, quantifiable structural properties like line length or number of identifiers tend to be in the focus of scientific literature, whereas grey literature sources focus more on the semantic structure, e.g., the Arrange-Act-Assert pattern. Table 5 provides an overview of the differences identified in our analysis.

Table 5 Differences in influence factors between scientific and grey literature

Test structure (5 vs 28)

(Str): Literature published in academic context tends to focus more on countable properties like maximum line length, amount of control structures, etc. (see, e.g., Grano et al. [A9], Daka et al. [A5] or Setiani et al. [A18]). In contrast, the authors of grey literature sources focus on a semantic form of structure like the AAA pattern, which is also discussed in another study by Setiani et al. [A17]. They report moderate positive influence on readability from this Arrange, Act, Assert structure.

Test names (6 vs 26)

(TeN): Like in the grey literature, scientific literature also mentiones the use of naming patterns, e.g., when test cases are renamed. Zhang et al. [A19] or Daka et al. [A6] use testSubjectOutcomeScenario where “Subject” is the method under test, although outcome and scenario can be left out. The approach by Roy et al. [A16] generates test names with a machine learning model based on the body of the test. According to examples given in the paper, this approach does not seem that it has to include the concrete method under test in the name. In other studies, e.g., Panichella et al. [A15] or Setiani et al. [A18], survey participants highlight the importance of meaningful test names.

Assertions (5 vs 24)

(Asse): Some grey sources suggest to apply the “one assertion per test” rule. However, there is little evidence in scientific literature about the effect of assertions on readability. Setiani et al. [A17] report low influence from assertion messages on readability. Furthermore, Setiani et al. [A17] and Daka et al. report negligible influence from the amount of assertions. Studies like Bai et al. (2022) or Panichella et al. (2022) from the field of test smells confirm the negligible importance of assertion messages and the number of assertions. Almasi et al. [A2] report concerns from developers about the meaningfulness of generated assertions. Leotta et al. [A10] report no significant influence on test comprehension when AssertJ is used instead of basic JUnit assertions. This contradicts the voices from grey literature, which suggest to improve readability with fluent assertions.

Helper structures (0 vs 23)

(Help): This factor has been identified only in the grey literature. In this context, the builder pattern and similar patterns are discussed, relating to practical recommendations for good design.

Dependencies (3 vs 19)

(Dep): The recommendation that one test should only test one functionality or behavior is mentioned by Palomba et al. [A14].Footnote 5 The participants from the study of Setiani et al. [A17] to some extent agree that a unit test should only depend on one unit, which reflects the opinion of this factor from grey literature.

Identifier names (5 vs 17)

(IdN): The survey from Lin et al. [A13] shows the importance of meaningful, concise and consistent identifiers. The renaming approach by Roy et al. [A16] also suggests variable names like expected and result. Their deep learning model was trained with software projects of a high level of quality. Therefore, it seems plausible that identifier names as those mentioned in the grey literature sources are commonly used in tests of high quality projects.

Fixtures (0 vs 17)

(Fix): This factor has been identified only in the grey literature, which discusses arguments for and against the use of test fixtures from a practical perspective.

DRY principle (0 vs 16)

(DRY): This factor has been identified only in the grey literature, n context of practical recommendations on how to apply this development principle to test code.

Test data (4 vs 15)

(TeD): Participants of the workshop from Bowes et al. [A4] also recommend to avoid magic values. Almasi et al. [A2] and Afshan [A1] highlight importance of meaningful or human-like test data.

Comments (2 vs 14)

(Com): The usage of comments for highlighting the structure of the test is not investigated in scientific literature. Fisher and Johnson [A7] explain different readability ratings between generated tests and human tests also with the lack of explanatory comments. Setiani et al. [A17] survey participants who also mention comments being to some degree important to readability. These findings are to some extent contradicting the recommendation in grey literature, which is generally to avoid such explanatory comments.

Domain specific language (0 vs 8)

(DSL): This factor has been identified only in the grey literature. It relates to test frameworks used in practice such as Gherkin.

Parameterized test (0 vs 8)

(Par): This factor has been identified only in the grey literature. It relates to practical suggestions to reduce code duplication by using parameterized tests.

Test summaries (4 vs 0)

(TS): This factor has been identified only in the scientific literature. It is related to the application of source code summarization techniques investigated in related research as support for understanding test code.

Textual features (1 vs 0)

(TF): This factor has been identified only in the scientific literature. It is related to the application of natural language processing investigated in related research for test cases.

figure o

6 Evaluation of Influence Factors

For the following experiment we take the results from the systematic mapping study (Section 4) and the grey literature review (Section 5) and investigate a selection of identified influence factors with focus on the perception of test case readability.

Fig. 6
figure 6

Experiment process and amount of received responses

6.1 Experiment Setup and Procedure

The experiment follows an A/B testing approach. The participants rate readability of original and altered test cases. Experiments based on A/B testing are a good approach for comparing the effect of a treatment to a population. In our scientific literature review we also found some studies using this approach, e.g., Roy et al. (2020a) or Setiani et al. (2021a). Participants of a master course on software testing at TU Wien were invited to participate voluntarily in this online experiment with the possibility of bonus points for the course as a reward. Figure  6 shows an overview on the experiment process. We discuss the individual steps in the following sections.

Table 6 Listing of test cases with their assigned influence factor, originating project and differences made for both versions

6.1.1 Select Tests

We searched open source repositories for test case that are related to the influence factors we identified in our literature study, specifically test cases adhering or contradicting to these factors. We selected 30 test cases covering different influence factors from 8 sources, which also contain generated tests by Randoop and Evosuite. Table 6 shows influence factor, test name and origin project. Most tests including the automatically generated tests come from the open source project Apache Commons Lang3. Other sources include the Spring Framework, IntelliJ, and Apache Flink. The last three tests with origin project “Student Solution" are selected tests written by students for a course assignment.

The test cases we found in our search and which are used in the subsequent experiment cover 7 out of the 14 influence factors (see Table 4 in Section 5.2 for a complete list of influence factors), identified in the literature study, since we limited our selection to only those test cases retrieved from real-world open source projects that can be clearly related to individual influence factors. Therefore, we included test cases related to the influence factors Structure (Str), Assertions (Asse), Dependencies (Depe), Test Data (TD), Comments (Co), Fixtures (Fix), and Parametrized Tests (Para) and excluded test cases related to Test Names (TeN), Identifyer Name (IdN), Test Summaries (TS), Textual Features (TF), Helper Structures (Help), DRY Principle (DRY), and Domain Specific Language (DSL).

6.1.2 Apply Best Practices

For each test, we create an alternative version, following the findings from the literature study. For example, long test cases (variant A) were modified by splitting them up into two or more smaller test cases (variant B). This modification corresponds to the best practice suggested in [G28]. Similarly, test cases using standard assertions (variant A) were modified by using dedicated assertion frameworks such as AssertJ (variant B), as suggested in [G18].

Table 6 provides an overview of the covered influence factors from the literature study and a short description on the differences between A and B version of the different tests in column “Modification A/B".

Furthermore, we used three additional test cases (not shown in the table) without modification as a control group. The purpose of these control tests is to verify that the participants show a consistent rating behavior for A and B tests, which allows us to assert the internal validity of the experiment.

6.1.3 Create Survey

We created surveys containing a subset of 12 test cases out of the entire set of the 30 tests listed in Table 6. Each survey contained an equal mix of A and B variants. In total we created 6 different surveys to provide full coverage of all 30 tests in each of the variants. The surveys were randomly distributed to the participants taking part in the experiment, who were unaware of the covered influence factors and whether the included tests were modified or unmodified.

The participants were asked to rate the readability on a 5 point Likert scale from 1 (unreadable) to 5 (easy to read) and to provide up to three free text reasons for their rating. Before and after this main task of the experiment, there is a pre- and post-questionnaire for collection information on the participants’ background and feedback about the experiment run.

We developed the questionnaires using google.forms, which provides an easy way for creating surveys that can also be reused for future replications. The collected data can be exported in various formats for processing and analysis. Beside the survey forms, we provided the selected tests in a PDF and as plain text files as additional supporting materials for the study participants.

6.1.4 Execute A/B Experiment

The online survey was open for two weeks and the participants were free to start and stop their run at any time in this period. The duration for taking part in the experiment was about one hour per participant. In total, 77 participants completed the survey.

6.1.5 Analysis

We use the software R to calculate the significance of the results with statistical tests on level of \(\alpha =0.05\%\). According to an analysis with the Shapiro-Wilk test, the rating data does not follow a normal distribution. Therefore, and since our data is unpaired, we use the Wilcoxon Rank Sum test. When a significant difference between the distribution of the groups A and B is found, we report the effect size with Cliff’s Delta (\(\delta \)). Roy et al. (2020a) used the same approach for their Likert scale data. Cliff’s Delta is interpreted according to Romano et al. (2006) with \(|\delta |<0.147\) “negligible", \(|\delta |<0.33\) “small", \(|\delta |<0.474\) “medium", otherwise “large".

Table 7 Information on participants experience

6.2 Experiment Results

This section presents the results of the experiment on the readability of the selected set of test cases to investigate the related influence factors. Some factors influencing readability appear more than once in Table 6 and the modifications have different goals. Therefore we analyse the differences between groups A and B across these modifications. We discuss each modification after an overview on the participants in the following sections.

Participants experience. To gather some information about our participants we asked for their amount of experience in general and professional software development in years. They could choose between 0, 1-2, 2-5 and >5 years. Table 7 shows results of both questions. Almost 45% of our participants have more than five years of experience in software development and more than 50% have two to five years of experience. Concerning professional development around 30% have either one to two or two to five years of experience. In total around 75% have worked at least one year.

6.2.1 Do factors discussed in practice show an influence on readability when scientific methods are used (RQ3.1)?

Figure 7 shows the distribution of the aggregated readability ratings including boxplots for the investigated modification mapping to influence factors. Table 8 shows the results from the statistical analysis. The first column “Modification A/B (Influence Factor) maps to the according columns in Table 6. We discuss each modification in the following sections. As a reminder, we interpret Cliff’s Delta (\(\delta \)) according to Romano et al. (2006) with \(|\delta |<0.147\) “negligible", \(|\delta |<0.33\) “small", \(|\delta |<0.474\) “medium", otherwise “large",

Fig. 7
figure 7

Distribution and box plots of aggregated readability ratings per A/B modification. Ratings from a five-point Likert scale range from 1 (not readable) to 5 (very readable). The numbers on the right hand side of the histograms represent the amount of answers for this rating

Table 8 Statistical analysis of experiment results using a two-sided Wilcoxon Rank Sum test (p) and Cliff’s D (\(\delta \)) for effect size

Loops vs. Unrolled (Fig. 7a).

In this modification the difference between A and B of the aggregated results is significant with \(p=0.02\). The effect size \(\delta =-0.35\) is on the lower end of a “medium" effect size. The analysis of the individual tests reveals that the whole modification is significant, because of the last test with \(p=0.01\) and \(\delta =-0.67\) (“large" effect). In this test the code contains two 2D arrays, nested loops to perform the test and string concatenation for the assertion message. The modified version primarily consists of assertions for all cases the loops generate, without assertion messages.

Try Catch vs. AssertThrows (Fig. 7b).

Overall there is no significant difference in the readability ratings between the original and the modified versions. Only in the second test the difference between A and B is barely significant \(p=0.04\), although it has a “large” effect size with \(\delta =-0.54\). One possible explanation for this result could be the relative short size of this test in comparison to the other ones in this modification. Due to the short length, there may be no possibility for other bad practices to mask the positive influence of this modification.

Variable Re-Use (Fig. 7c).

Neither the figure nor the statistical analysis show a significant difference in the ratings.

Structure (Fig. 7d).

Overall there is a clear difference between the groups of this modification with \(p=0.0\) and a “large” effect, \(\delta =-0.59\). Only for one of the three tests the difference between groups is not significant with \(p=0.16\).

Comments (Fig. 7e).

Although none of the individual tests has a significant difference between A and B, the aggregated result is significantly different with \(p=0.02\) and has a lower “medium” effect size with \(\delta =0.36\). Since we removed comments in the original versions of the tests, the A version contains more information than B. A look at Fig. 7e and the median values in the Table 8 shows that the participants gave the A version better ratings. This is also reflected by the positive sign of the effect size. The comments do not highlight the structure of the test, they are of the nature “explanatory comments”. This is a confirmation of the positive influence of comments on readability found by scientific literature.

Loops vs Parameterized (Fig. 7f).

Like in Loops vs. Unrolled the difference of the complete modification between groups A and B is significant with \(p=0.00\) and \(\delta =-0.34\), a lower “medium” effect size, because of the last test. The original version is the same as in Loops vs. Unrolled but the modified version extracts the test case data into an inlined CSV as input for the parameterized test case. The other forms of parameterized tests did not lead to significant changes in the readability ratings. In pursuit of the hypothesis from Section 7.2 we also compare the ratings of this A group with the A group from Loops vs. Unrolled. When looking at the median values the hypothesis seems to hold, because the values from this modification are lower in two of three tests. However, the Wilcoxon test does not detect a significant difference in the ratings with \(p=0.11\).

Split Up (Fig. 7g).

There is a clearly significant difference between A and B with \(p=0.0\) but only a “small” effect size, although with \(\delta =-0.33\) it is on the edge to a “medium” effect size. In detail there is one significant test \(p=0.01\) with \(\delta =-0.50\), a large effect size. When looking at the median values and the figures, we see that both versions are quite readable but the modified tests have few to no ratings in the lower part of the readability scale.

Assertions (JUnit, Hamcrest, AssertJ) (Fig. 7h).

There is no significant difference in readability when using standard JUnit assertions compared to assertions with Hamcrest or AssertJ assertions. This result confirms findings from Leotta et al. (2018a).

Unnecessary Try Catch (Fig. 7i).

One test shows a significant difference with \(p=0.01\) and \(\delta =-0.53\), a “large” effect size. With medians of 0.75 the first test is almost very readable in both versions. However, we accidentally introduced an error in the modified version (we declared a variable twice, which is not allowed in Java). In the comments the participants noticed this error, therefore this error might mask the positive effect of the intended modification. The second test with medians of 0.25 has a very long test name which the participants criticise. This again might mask the positive effect of the modification.

Fixture (Fig. 7j).

We do not see a significant difference between the two versions neither in the figure nor in the table. The tests all have a quite good rating, which is could be caused by the participants knowledge about the system under test.

figure s

7 Summary, Threats to Validity, and Future Work

This section summarizes the findings, validity, presents implications for research and practitioners, and provides discusses limitations and threats to validity, and provides future research directions.

7.1 Summary

The main goal of this paper was to combine scientific and practical views on the readability of software test code. We have conducted a Systematic Mapping Study (SMS) to cover relevant publications from academia to capture the scientific view on the readability of software test code. We have complemented the results of the SMS by taking into consideration practical views based on grey literature. Based on identified influence factors on test code, we conducted a controlled experiment in academic setting to explore the perception of software test code readability with a set of 77 participants.

We have identified unique readability factors in scientific literature that include readability models, application of code summaries used on test code that have been proposed and evaluated (see Section 4.2.1).

Individual Influence Factors have been evaluated in scientific literature by using scientific methods, such as online questionnaires with Likert scales and statistical analysis (see Section 4.2.2).

Differences in scientific and grey literature. In contrast to scientific literature grey literature provides a wide spectrum of best practices and guidelines concerning the influence factors of readability of test code. There is large overlap between science and grey literature, however in the overlapping factors we observed different views in the interpretations. (see Section 5.2.2).

Unique readability factors in grey literature. Furthermore we found five additional factors exclusive to grey literature, e.g. helper structures, test fixtures. Concerning these factors there exist different views and even conflicting opinions, often related to the used/applied technology, testing framework, and test level/approach (see Section 5.2.1).

Empirical study of influence factors widely discussed in practice. For half of the investigated modifications (Loops vs. Unrolled Loops; Package Names, If-Structures; Remove Comments; Loops vs. Parameterized and; Split Up Tests), which map to readability factors, we could show a statistical significant influence in test code readability. (see Section 6.2.1). Other factors are less clear, which can be attributed to the nature of best practices, which are sometimes only applicable in specific contexts and not in general (e.g., modification Try Catch vs. AssertThrows; (see Section 6.2.1).

7.2 Limitations and Threats to Validity

In this section, we summarize important limitations and threats to validity in context of the literature review (i.e., scientific and grey literature) and the empirical study.

Internal Validity

  • In context of the Systematic Mapping study, the keyword, search string, analysis items, and the data extraction and analysis has been executed by one of the authors and intensively reviewed and discussed within the author team and external experts.

  • The controlled experiment setup has been initially executed in a pilot run to ensure consistency of the experiment material. We have used a cross-over design of test case samples to avoid any bias of the experiment participants.

  • Three unmodified test cases were used as control groups in A/B testing. The Wilcoxon Rank Sum test does not suggest a significant difference between ratings provided by participants, when comparing groups with the same questionnaire. However, there is a significant effect when comparing control groups of different questionnaires. These results confirm a consistent rating behavior within groups and the significant differences between groups is as expected due to the independent ratings of participants from different groups.

External Validity

  • We have conducted a literature reviews based on the guidelines of Petersen et al. (2015) complemented by a systematic analysis of grey literature (Garousi et al. 2019). Therefore, the analysis results identified most prominent research directions in scientific literature complemented by practical discussions in non-academic sources (such as blogs). This approach enabled us to identify similar and/or different key topics in academia and industry.

  • Experiment participants were recruited on a voluntary basis from three classes of a master course on software testing at TU Wien. We captured background knowledge of the participants to identify participant experience. Most of the participants work in industry and can be considered as “junior professionals”. Therefore, the results are applicable for industry applications.

  • We used real-world test cases from open source projects as well as results from software testing exercises to ensure close to industry test cases.

Construct Validity

  • We build on best-practices for the literature review for academic publications (Petersen et al. 2015) and grey literature (Garousi et al. 2019) and followed experimentation guidelines, proposed by Wohlin et al. (2012) for conducting the empirical study.

  • For the controlled experiment, we captured individual test case assessments for A-B tests (i.e., original tests taken from existing projects and slightly modified test cases) based on a 5-point Likert scale.

  • To avoid a bias introduced by the order of questions for the experiment, we reversed the question ordering for half of the experiment groups.

  • To avoid random readability ratings, we asked participants to give reasons for their ratings as free text. Furthermore, the participants were told that their reward (bonus points) is coupled with active participation in the challenge.

  • We tried to select test cases for A/B testing in our experiment, which could be clearly related to individual influence factors. Since the test cases we used were retrieved from real world projects instead of constructed examples, which could limit the relevance of our results, we only covered 7 out of the 14 influence factors identified in the literature search. Nevertheless, a certain amount of fuzziness with respect to influence factors may still be present, e.g., as discussed in the results for the modification Try Catch vs. AssertThrows (see Section 6.2.1).

Conclusion Validity

  • We used the Shapiro-Wilk test for testing for normality, which would allow us to use a parametric statistical test. This approach is also used by Roy et al. (2020a) whose methodology is similar to ours.

  • We used the non parametric Wilcoxon Rank Sum test, because our groups are unpaired and the Shapiro-Wilk test does not suggest a normal distribution of our result data.

  • We report the effect size with Cliff’s Delta, because it allows an interpretation of the magnitude of difference between two groups. It is also used by other studies in this field like Grano et al. (2018a).

7.3 Implications for Research and Practitioners

This section summarizes the main implications of the SMS (Section 4), the grey literature study (Section 5), and the experiment (Section 6).

Implications for Research

  • For the Software Testing community, we identified influencing factors, observed only in grey literature, that could initiate additional research initiatives with focus on topics that are of interest for practitioners with limited attention of researchers.

  • Researchers in Software Engineering and/or Software Testing can take up the results from literature review with focus on replicating and extending the presented research work.

  • The Empirical Research community can build on the the SMS protocol, the grey literature protocol, and the study design to replicate and extend the study protocol in different context.

  • We selected a representative set of test cases that could be used by researchers to (i) design and develop a method and or tool to semi-automatically assess the readability of test code and (ii) to apply the test code set for evaluation purposes in different contexts.

  • In the Software Testing communities, factors, such as Setup methods/Fixtures, Helper Methods, DRYness are widely discussed in the domain of practitioners. Considering these in test code generation could be useful for generating more readable tests. In a recent study Panichella et al. (2022) also suggest to include capabilities for complex object instantiation into test suite generators.

  • Finally, the findings of the study can be used as input for researchers from Software Engineering communities to improve software maintenance tasks that benefit from readability assessments.

Implications for Practitioners

  • For Software and System Engineering organizations, results of this work can support software testers and developers to improve test code readability based on guidelines and identified influencing factors.

  • Project and Quality Managers can use the results to setup organization specific development guidelines to support software development, software testing, and software maintenance and evolution by a team of software experts. Applied best practices might help to improve the quality of test cases and reduce effort and cost for maintenance activities.

  • Factors with similar views from practitioners and academia include Test Names, Identifier Names, and Test Data. For test and identifier names both domains agree on the use of naming patterns in order to achieve consistency across the test suite. For test data also both domains agree on the use of realistic and simple values and avoiding magic values.

  • However, the experiment results show that application of best-practices is no guarantee for improved readability.

7.4 Future Work

The main goal of this article was to Investigate the readability of Software Test Code by combining scientific and practical views. We applied a systematic mapping study for analyzing scientific literature complemented by grey literature. Furthermore, we executed a controlled experiment in a Software Testing Master Course on academic level to investigate practical implications of a selected but typical set of test cases.

In the future, we plan to replicate the experiment to increase the external validity of the study in academia, complemented by industry participants. Furthermore, we plan to develop and evaluate a maturity model for the readability of test cases that could help quality managers, software test engineers, and software developer in better assessing the quality of test cases (from readability perspective) to improve software maintenance and evolution.