Empirical Software Engineering

, Volume 14, Issue 1, pp 57–92 | Cite as

Assessing IR-based traceability recovery tools through controlled experiments

  • Andrea De Lucia
  • Rocco Oliveto
  • Genoveffa Tortora
Article

Abstract

We report the results of a controlled experiment and a replication performed with different subjects, in which we assessed the usefulness of an Information Retrieval-based traceability recovery tool during the traceability link identification process. The main result achieved in the two experiments is that the use of a traceability recovery tool significantly reduces the time spent by the software engineer with respect to manual tracing. Replication with different subjects allowed us to investigate if subjects’ experience and ability play any role in the traceability link identification process. In particular, we made some observations concerning the retrieval accuracy achieved by the software engineers with and without the tool support and with different levels of experience and ability.

Keywords

Traceability recovery Information retrieval Latent semantic indexing Singular value decomposition Program comprehension Impact analysis 

1 Introduction

Traceability refers to the ability to define, capture and follow the traces left by requirements on other software artefacts and the traces left by those artefacts on requirements (Gotel and Finkelstein 1994; Pinhero and Goguen 1996). Thus, traceability information helps software engineers to understand the relationships and dependencies among various software artefacts. For this reason, software artefact traceability is widely recognised as an important factor for effectively managing the development and evolution of software systems, as traceability is fundamental to help in program comprehension, maintenance, impact analysis, and reuse of existing software (Antoniol et al. 2002).

The potential benefits of traceability are well known, as well as the impracticability of maintaining traceability links manually. Indeed, the manual management of traceability information is an error prone and time consuming task (Domges and Pohl 1998; Leffingwell 1997). This is the reason why very often, developers and maintainers do not perform it to an appropriate level of detail, or do not keep traceability information up-to-date during software development and maintenance (Domges and Pohl 1998; Leffingwell 1997). The need to provide the software engineer with methods and tools supporting traceability recovery has been widely recognised in the last years. In particular, several researchers have recently applied Information Retrieval (IR) (Baeza-Yates and Ribeiro-Neto 1999; Deerwester et al. 1990; Harman 1992) techniques to the problem of recovering traceability links between artefacts of different types (Antoniol et al. 2000a, 2002; Cleland-Huang et al. 2005; De Lucia et al. 2006a, 2007b; Di Penta et al. 2002; Hayes et al. 2003, 2006; Lormans and van Deursen 2006; Marcus and Maletic 2003; Settimi et al. 2004).

Clearly, a research method or tool has more chances to be transferred to practitioners if its usefulness is investigated through empirical user studies (Pfleeger and Menezes 2000). Unfortunately, until now the evaluation of IR-based traceability recovery methods and tools has been limited to their tracing accuracy, rather than to the analysis of whether such tools actually help the software engineer during traceability recovery. Indeed, user studies are needed in order to analyse how such tools affect the tracing accuracy of software engineers during the link identification process. Preliminary case studies involving users (Antoniol et al. 2002; De Lucia et al. 2007b) confirm the ability of IR-based traceability recovery tools to support the software engineer to discover untraced links. Unfortunately, determining trends and statistical validity with case studies is often difficult, and it is difficult to compare the usefulness of different methods and/or tools (Wohlin et al. 2000).

In De Lucia et al. (2007a) a preliminary controlled experiment has been carried out to statistically analyse how the tracing accuracy of the software engineer are affected by the use of an IR-based traceability recovery tool. In particular, the authors evaluated the usefulness of ADAMS Re-Trace (De Lucia et al. 2007b, 2008), the traceability recovery tool of ADAMS (ADvanced Artefact Management System), a fine-grained artefact management system (Bruegge et al. 2006; De Lucia et al. 2004). The experimentation involved 20 first year master students at the University of Salerno, Italy, who had to perform (with and without the tool support) two traceability recovery tasks on a software repository of a completed project. The achieved results demonstrated that the use of a traceability recovery tool significantly improves the tracing accuracy of the software engineer, measured as the harmonic mean of his/her precision and recall (Baeza-Yates and Ribeiro-Neto 1999; Harman 1992). In particular, it significantly reduces the percentage of false positives traced (better precision), although it does not significantly help to recover more correct links (better recall). Moreover, it was observed that the tool significantly reduces the time spent by the software engineer to trace links.

In this paper we present a replication of the same experiment performed with different subjects. In particular, we consider second year master students having a different level of experience in performing traceability tasks with respect to first year master students. In this way we were able to analyse the reaction of different categories of users when they use a traceability recovery tool. Moreover, we discriminated subjects according to the respective level of ability, with the purpose to test the hypothesis that this is also a relevant influencing factor that should be taken into account when adopting such kinds of tools. In summary, the specific contributions of this work with respect to De Lucia et al. (2007a) are:
  • a replication of the same experiment with other subjects;

  • a compared analysis of the results of both the experiments in order to analyse the effect of other factors (experience, ability, and traceability task) on the tracing accuracy, in addition to the method used to trace links.

The results achieved in the replication confirm that the tool significantly reduces the time spent by the software engineer to trace links. Moreover, the comparative analysis of the results of both the experiments demonstrates that the use of a traceability recovery tool in general improves the tracing accuracy of a software engineer. In particular, the tool helps software engineers with low ability to achieve similar tracing accuracy as software engineers with high ability. Finally, some considerations can also be made considering the subjects’ experience.

The paper is organised as follows. Section 2 discusses related work, while Section 3 gives an overview of the tool used in the two experiments. Section 4 describes the design of the controlled experiments, while Section 5 reports the statistical analysis of the achieved results. Sections 6 and 7 discuss the threats to validity and the achieved results, respectively. Finally, conclusion and directions for future work are reported in Section 8.

2 Related Work

In the last decade several authors have applied IR (Baeza-Yates and Ribeiro-Neto 1999; Deerwester et al. 1990; Harman 1992) methods to the problem of recovering traceability links between software artefacts. In general, an IR-based traceability recovery tool compares a set of source artefacts against another set of target artefacts and ranks the similarity of all possible pairs of artefacts. It is worth noting that the sets of source and target artefacts might also overlap. The conjecture is that artefacts having a high textual similarity probably share several concepts, so they are likely good candidates to be traced from one to another. For this reason, such tools also use some method (e.g., a threshold on the similarity level) to cut the ranked list, presenting the software engineer only the subset of top links in the ranked list (Antoniol et al. 2002; De Lucia et al. 2007b; Hayes et al. 2006; Lormans and van Deursen 2006; Marcus and Maletic 2003).

2.1 IR-based Traceability Recovery Methods and Tools

The first methods used to recover traceability links between software artefacts were the probabilistic and the vector space models (Antoniol et al. 2002). The former method computes the similarity score as the probability that an artefact is related to another artefact, while the latter method represents artefacts as vectors of terms of a vocabulary extracted from the artefacts themselves and calculates the similarity between two artefacts as the cosine of the angle between the corresponding vectors (Baeza-Yates and Ribeiro-Neto 1999; Harman 1992). In Antoniol et al. (2002) the probabilistic and vector space models were used to recover traceability links between requirements and Java classes and between manual pages and C++ classes. The probabilistic model was also used to recover traceability links between requirements and source code (Antoniol et al. 2000a; Di Penta et al. 2002), as well as between requirements and design artefacts (Cleland-Huang et al. 2005; Zou et al. 2007). Also, the vector space model (Baeza-Yates and Ribeiro-Neto 1999; Harman 1992) was used to recover traceability links between requirements (Hayes et al. 2003, 2006), between maintenance requests and software documents (Antoniol et al. 2000b), between requirements or design documents and defect reports (Yadla et al. 2005), and between several others types of artefacts (e.g., use cases, UML diagrams, code artefacts, test cases) (De Lucia et al. 2006a; Settimi et al. 2004).

A common criticism of vector space models is that it does not take into account relations between terms. Latent Semantic Indexing (LSI) is an extension of the vector space model developed in order to overcome such a problem. Indeed, LSI assumes that there is some underlying or “latent structure” in word usage that is partially obscured by variability in word choice, and uses statistical techniques to estimate this latent structure, namely Singular Value Decomposition (Cullum and Willoughby 1998). Thus, the retrieval is based on the semantic content of the artefacts rather than their lexical content. As a consequence, a relevant target artefact may be retrieved even if it does not share many literal terms with the source artefact (Deerwester et al. 1990). Promising results have been achieved applying LSI for traceability link recovery. In particular, in Marcus and Maletic (2003) the tracing accuracy of LSI was compared with respect to the vector space and probabilistic models replicating the case studies presented in Antoniol et al. (2002). The achieved results showed how the latter models require morphological analysis of text contained in source code and documentation to achieve the same performances as LSI (Marcus and Maletic 2003). LSI was also used to recover traceability links between high level and low level requirements (De Lucia et al. 2006a; Hayes et al. 2006) and between several artefact types (De Lucia et al. 2006a, 2007b; Lormans and van Deursen 2006; Lormans et al. 2006).

Table 1 summarises all the IR-based approaches proposed to address the traceability recovery problem. For each approach the table reports the IR method used, the description of enhancing strategies of the basic IR method, how the approach was evaluated (i.e., by case study and/or user study), and which types of links were recovered. Based on the traceability recovery methods proposed in the literature, several tools have also been implemented. Table 2 classifies them according to the IR method used to recover the links, enhancing strategies of the recovery technique, and the architecture of the tool. It is important to note that among them only ADAMS Re-Trace (De Lucia et al. 2007a) has been integrated in an artefact management system. A complete survey on all these methods and tools can be found in De Lucia et al. (2007b), Oliveto (2008).
Table 1

Summary of IR-based traceability recovery methods

 

IR method

Enhancing strategies

Evaluation

Which types of links were recovered

Antoniol et al. (2002)

Probabilistic and vector space models

None

Case study and user study

Links between C++ source code and manual pages and between Java code and functional requirements

Antoniol et al. (2000a)

Probabilistic model

Training set

Case study

Links between Java code and functional requirements

Di Penta et al. (2002)

Probabilistic model

Training set

Case study

Links between requirements and source code (produced using a RAD development tool)

Cleland-Huang et al. (2005)

Probabilistic model

Hierarchical modelling, logical clustering of artefacts, and semi-automated pruning of the probabilistic network

Case study

Links between requirements and code classes

Zou et al. (2007)

Probabilistic model

Query term coverage and phrasing

Case study

Links between requirements and design artefacts

Duan and Cleland-Huang (2007)

Probabilistic model

Agglomerative hierarchical clustering, K-means clustering, Bisecting divisive clustering

Case study

Links between requirements and design artefacts

Antoniol et al. (2000b)

Vector space model

None

Case study

Links between maintenance requests and software documents

Yadla et al. (2005)

Vector space model

None

Case study

Links between requirements or design documents and defect reports

Settimi et al. (2004)

Vector space model

Thesaurus and pivot normalisation weighting score

Case study

Links between requirements onto UML artefacts, code, and test cases

Hayes et al. (2003)

Vector space model

Key-phrases and thesaurus

Case study

Links between high level and low level requirements

Hayes et al. (2006)

Vector space model and LSI

User feedback

Case study

Links between high level and low level requirements

De Lucia et al. (2006a)

Vector space model and LSI

User feedback

Case study

Links between high level and low level requirements and between other several artefacts types, i.e., use cases, interaction diagrams, code classes, and test cases

Marcus and Maletic (2003)

LSI

None

Case study

Links between C++ source code and manual pages and between Java code and functional requirements

Lormans and van Deursen (2006), Lormans et al. (2006)

LSI

New strategy for selecting traceability links

Case study

Links between requirements and design artefacts and between requirements and test cases

De Lucia et al. (2007a)

LSI

New strategy for selecting traceability links and incremental process

Case study and user study

Links between several artefacts types, i.e., use cases, interaction diagrams, code classes, and test cases

Table 2

Summary of IR-based traceability recovery tools

Tool name

IR method

Enhancing strategies

Architecture

ADAMS Re-Trace De Lucia et al. (2008)

Latent semantic indexing

None

Web-based and eclipse plug-in

Poirot:TraceMaker Lin et al. (2006)

Probabilistic model

Hierarchical modelling

Web-based

ReqAnalyst Lormans and van Deursen (2006)

Latent semantic indexing

None

Web-based

RETRO Hayes et al. (2006)

Vector space model and Latent semantic indexing

User feedback

Standalone

TraceViz Marcus et al. (2005)

Latent semantic indexing

None

Eclipse plug-in

2.2 Evaluation of IR-based Traceability Methods and Tools

All the experiments conducted in these papers were case studies where the links retrieved by the IR-based traceability recovery tool are compared against a traceability matrix provided by the original developers at the end of the process (this matrix is intended to contain the correct links). Indeed, any IR-based traceability recovery tool will fail to retrieve some of the correct links, while on the other hand it will also retrieve links that are not correct. The tracing accuracy of IR-based tools are measured using two IR metrics, namely recall and precision (Baeza-Yates and Ribeiro-Neto 1999; Harman 1992). Both measures have values between [0, 1]. If the recall is 1, it means that all correct links were recovered, though there could be recovered links that are not correct. If the precision is 1, it means that all recovered links were correct, though there could be correct links that were not recovered.

Until now, few users studies have been performed in order to analyse the actual support given by such tools when used by users during the traceability recovery process. In Antoniol et al. (2002) the authors presented the results of a preliminary study where they compare the IR-based approaches against a “grep” brute force traceability link recovery demonstrating the benefits of a more sophisticated technology, like an IR method.

De Lucia et al. (2007b) integrated a LSI-based traceability recovery tool in ADAMS (ADvanced Artefact Management System), a fine-grained artefact management system (Bruegge et al. 2006; De Lucia et al. 2004). To validate the traceability recovery method and tool the authors performed a case study where the tool was used by about 150 users in 17 software development projects. Each project team included undergraduate students with development roles and master students with roles of project and quality management, respectively. Graduate students were also in charge to maintain up-to-date the traceability between the software artefacts produced by their team using the traceability features of ADAMS. In particular, they had the possibility to trace links manually or with the tool support. At the end of the experimentation the authors analysed the links traced by students in order to verify if the traceability recovery tool was used during the development process. They observed that almost all the links traced by the students were traced with the tool support. Moreover, at the end of the experimentation, students evaluated ADAMS through a questionnaire. The analysis of the answers revealed that students found the tool useful during the traceability recovery process.

The main drawback of these case studies is that determining trends and statistical validity is often difficult, and it is difficult to compare the usefulness of different methods and/or tools (Wohlin et al. 2000). In particular, it is necessary to statistically analyse how the tracing accuracy of the software engineer is affected by the use of a traceability recovery tool. For this reason we have conducted controlled experiments aiming to achieve such a statistical evidence.

3 The ADAMS Traceability Recovery Tool

In this section we present the traceability recovery tool of ADAMS used in the experimentation, focusing the attention only on its recovery functionality. More details on the tool can be found in De Lucia et al. (2007b, 2008). ADAMS (ADvanced Artefact Management System) is a fine-grained artefact management system (Bruegge et al. 2006; De Lucia et al. 2004) that also stores traceability links useful for impact analysis and change management during software evolution. In particular, ADAMS uses the traceability layer for the propagation of events, such as is the modification of the state of an artefact or the creation of a new version for it. ADAMS enables software engineers to manually manage traceability links between artefacts. Obviously, when the number of project artefacts is high, traceability management tends to be a difficult task. For this reason, a LSI-based traceability recovery tool, called ADAMS Re-Trace (De Lucia et al. 2007b, 2008), has been integrated in ADAMS aiming at supporting the software engineer during the traceability link identification.

ADAMS Re-Trace allows the software engineer to recover traceability links between a set of source and target artefacts through a three steps wizard (see Fig. 1). In particular, in the first step of the traceability recovery wizard the software engineer selects the source and target categories (i.e., types) and he/she can filter on the names of artefacts; while, in the second step he/she selects the artefacts (belonging to the selected categories) he/she is interested in. In the third and final step the software engineer selects the process to use for analysing the candidate links. In particular, he/she can visualise the full ranked list of candidate links (“one-shot” process), or he/she can define a similarity threshold to cut the list of candidate links and consider as candidate links only the pairs of artefacts with similarity above such a threshold.
Fig. 1

Traceability recovery wizard in ADAMS. Selection of source and target artefact categories (a); selection of source and target artefacts (b); and selection of the approach used to show the list of candidate links (c)

It is worth noting that the lower the similarity threshold used, the higher the number of correct links as well as the number of false positives retrieved. Indeed, an IR-based tool is able to suggest correct links with good precision only in the upper part of the ranked list, where the density of the correct links is higher. In the lower part of the list the density of correct links is very low and there is a great predominance of false positives. Therefore, identifying correct links in the lower part of the ranked list is similar to the case of deleting (unclassified) spam messages from the incoming e-mail box: besides being very tedious, the risk to also delete correct messages is likely to increase with the number of spam messages. For this reason, the traceability recovery tool of ADAMS allows the software engineer to incrementally decrease the similarity threshold aiming at keeping under control the number of validated correct links and the number of discarded false positives. In particular, the process should start with a high threshold that is decreased at each iteration. The links suggested by the tool can be analysed and classified step-by-step and the process can be stopped when the effort to discard false positives is becoming much higher than the effort to identify new correct links.

The traceability recovery tool of ADAMS maintains for each pair of artefact types the lowest and the last thresholds used by the software engineer in previous traceability sessions. In particular, in the third step of the traceability recovery wizard the software engineer has the possibility to select one of the following thresholds to cut the ranked list (see Fig. 1):
  • the default threshold, i.e. 95%;

  • the lowest threshold used by the software engineer in previous sessions;

  • the last threshold used in the previous session.

Once the source and target artefacts have been selected and the similarity threshold has been fixed, the software engineer can start the traceability recovery session. The tool compares the links retrieved by using LSI (whose similarity values are greater or equal to the threshold) with the links traced by the software engineer. In this way the tool is able to show the software engineer only retrieved links that are not traced yet. Figure 2 shows the list of links suggested by the tool. As said before, the software engineer can analyse such a list, trace the correct links and discard the false positives. Moreover, the software engineer has also the possibility to decrease the threshold and analyse new candidate links.
Fig. 2

Analysis of suggested links in ADAMS

As we can see in Fig. 2, ADAMS Re-Trace also maintains cumulative information concerning the previous recovery iterations (where a higher threshold was used) of the same session, in terms of number of suggested links and number of links classified as correct links or false positives (see the top right corner of the Fig. 2). In the scenario shown in Fig. 2, the software engineer is performing a traceability recovery session using the incremental process and a threshold equal to 80% to cut the list of candidate links. As we can see, the software engineer previously performed another traceability recovery iteration using a higher similairty threshold, i.e., 90%. In the previous iteration the tool globally proposed 2 links (Suggested links in Fig. 2) and they were both classified as correct links by the software engineer (Traced links in Fig. 2). Moreover, the time spent by the software engineer to analyse the list of candidate links is also reported. It is important to note that analysing these information the software engineer can decide to stop the recovery session when he/she has the perception that the cost of discarding false positives is becoming to high compared with the benefits of identifying new correct links.

4 Experiment Design

This section describes in detail the definition, design, and settings of the proposed experimentation following the guidelines by Wohlin et al. (2000) and Juristo and Moreno (2001). According to the two-dimensional classification scheme by Basili et al. (1986), we performed replicated-project studies, as we examined objects across a set of teams and a single project.

4.1 Experiment Definition and Context

The goal of the experiments was to analyse the tracing accuracy of the software engineer, in terms of time spent and links traced, with the purpose of evaluating the usefulness of ADAMS Re-Trace with respect to manual tracing. The quality focus was ensuring better tracing accuracy, while the perspective was both (i) of a researcher, who wants to evaluate how a traceability recovery tool based on an IR technique helps the software engineer during the link identification; and (ii) of a project manager, who wants to evaluate the possibility of adopting the tool within his/her own organisation, depending on the skills of the involved human resources.

4.1.1 Subjects

The study was executed twice at the University of Salerno, Italy, with different subjects. The subjects participating in the two experiments were 20 first year master students attending an advanced course of software engineering (Exp I) and 12 second year master students attending a course of software project management (Exp II). Within each experiment, all the students were from the same class with a comparable level of background, but different levels of ability. All the students had knowledge of both software development and software documentation, as well as of software artefact traceability. Moreover, students involved in the second experiment had participated in software projects with management roles including traceability management responsibilities.

4.1.2 Object

In the context of the experiment, subjects had to perform two traceability recovery tasks over an artefact repository of a completed software project, called EasyClinic, carried out by final year master students at the University of Salerno. The project aimed at developing a software system implementing all the operations required to manage a medical ambulatory. The artefact repository was composed of 30 use cases, 20 interaction diagrams, 63 test cases, and 37 code classes. The traceability matrix provided by the original developers was used to evaluate the results achieved by each subject, i.e. to define the number of links correctly and erroneously traced.

4.1.3 Treatments

The experiment was performed in a controlled laboratory setting and the two traceability recovery tasks that students had to perform were:
T1

recovering traceability links between 30 use cases and 37 code classes (the number of all correct links is 93);

T2

recovering traceability links between 20 interaction diagrams and 63 test cases (the number of all correct links is 83).

The experiment was organised in two laboratory sessions. In each session, subjects used two different methods to perform the traceability recovery tasks. In particular, in the first method, they manually performed the task by filling-in an empty traceability matrix, while in the second method subjects performed the task using ADAMS Re-Trace.

4.2 Hypothesis Formulation

The main objective of our study was to analyse how the use of ADAMS Re-Trace affects the time spent and the tracing accuracy of subjects during traceability link identification. When the null hypothesis can be rejected with relatively high confidence, it is possible to formulate an alternative hypothesis, which typically admits a positive effect of ADAMS Re-Trace on the time spent and/or the tracing accuracy of subjects (Wohlin et al. 2000). Thus, we formulated the following null hypotheses:
\({\rm H}_{0_{t}}\)

the use of ADAMS Re-Trace does not significantly affect the time spent by the software engineer to trace the links;

\({\rm H}_{0_{ta}}\)

the use of ADAMS Re-Trace does not significantly affect the tracing accuracy of the software engineer.

Consequently, the alternative hypotheses are:
\({\rm H}_{a_{t}}\)

the use of ADAMS Re-Trace significantly affects the time spent by the software engineer to trace the links;

\({\rm H}_{a_{ta}}\)

the use of ADAMS Re-Trace significantly affects the tracing accuracy of the software engineer.

Our study was also devoted to investigating how subjects’ experience (first year vs. second year master students) and ability interact with the use of a traceability recovery tool and affect the time spent and the tracing accuracy of subjects. This required to formulate six further null hypotheses:
\({\rm H}_{0_{et}}\)

subjects’ experience does not significantly interact with the use of ADAMS Re-Trace in affecting the time spent by the software engineer to trace the links;

\({\rm H}_{0_{at}}\)

subjects’ ability does not significantly interact with the use of ADAMS Re-Trace in affecting the time spent by the software engineer to trace the links;

\({\rm H}_{0_{eat}}\)

subjects’ ability and experience do not significantly interact with the use of ADAMS Re-Trace in affecting the time spent by the software engineer to trace the links;

\({\rm H}_{0_{eta}}\)

subjects’ experience does not significantly interact with the use of ADAMS Re-Trace in affecting the tracing accuracy achieved by the software engineer;

\({\rm H}_{0_{ata}}\)

subjects’ ability does not significantly interact with the use of ADAMS Re-Trace in affecting the tracing accuracy achieved by the software engineer;

\({\rm H}_{0_{eata}}\)

subjects’ ability and experience do not significantly interact with the use of ADAMS Re-Trace in affecting the tracing accuracy achieved by the software engineer.

The related alternative hypotheses can be easily derived.

4.3 Variable Selection

In the context of the proposed experimentation, we identify the following independent variables, also called factors:
  • Method: the main factor of our study, i.e., performing a traceability recovery task using ADAMS Re-Trace (AT) or manually (MT).

  • Task: as described in Section 4.1.3, the experiment involved two traceability recovery tasks on the EasyClinic artefact repository, indicated with T1 and T2.

  • Lab: the two laboratory sessions, indicated with Lab1 and Lab2.

Other than Method, the experimental hypotheses were defined in terms of two other factors:
  • Experience: the experience level of subjects involved in the first experiment was classified as Low Experience (Low), as students had knowledge of software artefact traceability but they did not have experience with traceability tasks. On the other hand, we classified High Experience (High) the experience of subjects involved in the second experiment because they also had experience with traceability tasks in software projects.

  • Ability: a quantitative assessment of the ability level of each involved subject was obtained by considering the average grades obtained at the previous exams. Subjects with average grades below a fixed threshold, i.e., 24/30,1 were classified as Low Ability (Low), while the remaining ones as High Ability (High).

The main outcomes observed in the study were the time spent and the tracing accuracy of subjects. In order to test the first null hypothesis, i.e. \(H_{0_{t}}\), we considered as dependent variable the time spent by the subject to complete the traceability task. Concerning the hypothesis \(H_{0_{ta}}\), the tracing accuracy of the subjects in each task were assessed using well known IR metrics (Baeza-Yates and Ribeiro-Neto 1999; Harman 1992):
$$\mbox{\emph{recall}} = {{|traced \cap correct|} \over {|correct|}} {\mbox{\%}} \qquad \mbox{\emph{precision}} = {{|traced \cap correct|} \over {|traced|}} {\mbox{\%}}$$
where correct is the set of correct links in the original traceability matrix for the given task and traced is the set of links traced by the subject. Recall measures the percentage of correct links traced by a software engineer, while precision measures the percentage of traced links that are actually correct. Since the two above metrics measure two different concepts, we assessed the global tracing accuracy of a software engineer using the F-measure (the harmonic mean of precision and recall):
$$\mbox{\emph{F-measure}} = 2 * {{precision * recall} \over {precision + recall}} {\mbox{\%}}$$

4.4 Experiment Design and Procedure

The assignment given to each group of subjects in each experimental session (Lab 1 and Lab 2) followed the counter-balanced experimental design in Table 3. Such a design ensured that each subject worked on different tasks in the two laboratory sessions, using each time a different method. Also, the design permitted considering different combinations of Task and Method treatments in different order across laboratory sessions. It important to note that the chosen design permitted the use of statistical tests (Two-Way and Three-Way Analysis of Variance (Devore and Farnum 1999)) to analyse the effects of multiple factors.
Table 3

Experiment design

 

Group

A

B

C

D

Lab1

T1 / AT

T1 / MT

T2 / AT

T2 / MT

Lab2

T2 / MT

T2 / AT

T1 / MT

T1 / AT

Subjects performed the tasks individually. In particular, we organised subjects in two sets taking into account the ability level and then randomly distributed them among the laboratory groups making sure that High and Low ability subjects were equally distributed across groups. Before the experiments, we presented to the subjects the traceability recovery tool showing how to trace links with the tool support using and not using the incremental approach (De Lucia et al. 2007b). Moreover, we let subjects get confidence with the tool by performing some simple traceability recovery tasks on a software artefact repository not related to EasyClinic to avoid biasing the experiment. Finally, right before the experiments, we showed to the students a presentation with detailed instructions on the tasks to be performed.

It is important to note that students were not familiar with the EasyClinic project upfront. Such a situation could influence our experiments because it is very difficult to reach an agreement on what should be a link and what should not be a link when there is no domain knowledge. This is the reason why a month before the experiment execution we gave to the students both the system documentation and the source code. Moreover, meetings made by students together with system experts were scheduled twice a week with the aim to enrich the students’ domain knowledge.

During the experiment each student was provided with the following material:
  • handouts of the introductory presentation and the user guide of the traceability recovery tool;

  • an electronic version of the software documentation and the source code of EasyClinic;

  • ADAMS-Re Trace or an empty traceability matrix to fill-in (depending on the treatment);

  • a survey questionnaire (shown in Table 4) to be filled-in after each laboratory session.

The survey questionnaire was filled-in by subjects at the end of each laboratory session. Thus, each subjects filled-in two survey questionnaires. Each questionnaire was composed of questions expecting closed answers according to the Likert scale (Oppenheim 1992) – from 1 (strongly agree) to 5 (strongly disagree) – to assess if the system’s domain and task were clear, if subjects had enough time, and other related questions (Q1 to Q5). In addition, for subjects using the tool, the survey investigated on the usefulness and usability of the tool (Q6 to Q8).
Table 4

Post-experiment questionnaire

Id

Question

Q1

I had enough time to perform the lab task

Q2

The domain of the system was perfectly clear to me

Q3

The objectives of the lab were perfectly clear to me

Q4

The task I had to perform was perfectly clear to me

Q5

I experienced no major difficulties in performing the task

Q6

The use of ADAMS Re-Trace was clear to me

Q7

I found the suggestions of the tool (proposed links) useful

Q8

I prefer to apply the incremental approach during the link identification process

For each Lab, subjects had 3 h available to perform the required Task. After the task was completed, they returned us the filled-in survey questionnaire. Moreover, subjects not using the traceability recovery tool sent us by email the filled-in traceability matrix, while links traced by subjects using the tool were automatically stored in the ADAMS database.

5 Experiment Results

After the experiment execution, we collected the links traced by each subject in each task and used the original traceability matrix of EasyClinic to compute the F-measure of each subject. Table 5 reports descriptive statistics of the dependent variables, i.e., time and F-measure, grouped by Method and Task.
Table 5

Descriptive statistics

 

Task/method

Time (min)

F-Measure (%)

Recall (%)

Precision (%)

Median

Mean

Std. Dev.

Median

Mean

Std. Dev.

Median

Mean

Std. Dev.

Median

Mean

Std. Dev.

Exp I

All

MT

120.00

125.20

27.22

53.41

52.38

16.06

58.44

59.72

20.33

54.55

55.64

26.06

AT

75.00

83.75

23.78

65.47

61.36

17.16

60.84

58.81

21.77

69.81

68.54

15.97

T1

MT

115.00

115.50

22.54

49.47

47.22

17.79

74.19

67.74

22.19

39.63

38.77

16.93

AT

72.50

80.00

22.24

52.66

53.61

14.62

51.62

52.69

20.60

61.22

59.07

12.22

T2

MT

122.50

135.00

29.06

59.91

57.52

13.01

50.00

51.69

11.39

71.97

72.51

22.70

AT

87.50

87.50

25.85

71.93

69.10

16.58

62.66

64.94

22.20

82.45

78.00

13.78

Exp II

All

MT

142.00

138.90

14.58

75.46

71.59

14.50

64.19

67.70

16.98

77.99

78.66

16.76

AT

87.50

85.25

13.40

75.35

75.39

13.46

65.06

65.22

11.89

74.84

73.35

19.51

T1

MT

140.00

134.50

19.01

79.44

73.75

14.42

80.96

71.20

18.63

77.99

80.07

15.69

AT

92.50

92.17

12.36

65.13

64.98

9.76

75.35

72.24

12.40

61.02

64.19

22.49

T2

MT

142.00

143.30

7.71

66.76

69.43

15.60

62.80

64.19

16.06

77.56

77.24

19.15

AT

78.50

78.33

11.33

78.07

79.25

7.91

77.46

78.54

14.86

84.48

82.50

11.44

Task: T1 (tracing use cases onto code classes) or T2 (tracing interaction diagrams onto test cases)

Method: MT (manual tracing) or AT (ADAMS Re-Trace)

The next sections report the results achieved in the two experiments, analysing the effect on the dependent variables of the main factor (Method) and of other factors. However, we also analysed the effects of the independent variables on the precision and recall achieved by subjects, so Table 5 also shows the descriptive statistics of these variables. Finally, results from the analysis of survey questionnaires and the discussion of threats to validity are also reported.

5.1 Influence of Method

In order to test the first two hypotheses, i.e. \(H_{0_{t}}\) and \(H_{0_{ta}}\), we analysed the effect of Method on the two dependent variables, i.e., time and F-measure. Since experiments were organised as longitudinal studies where each subject performed two different traceability recovery tasks, with the two possible treatments (with and without the tool support) it was possible to use a paired Wilcoxon one-tailed test (Conover 1998) to analyse the differences exhibited by each subject for the two treatments. It is important to note that the results were intended as statistically significant at α = 0.05. The results of the tests (i.e., p-values) are reported in Table 6. The table also reports descriptive statistics of differences achieved by subjects and the percentage of positive differences (i.e., % of Positive Effect), which is obtained counting the number of subjects that achieved better results performing the task with the tool support.
Table 6

Wilcoxon paired test p-values and descriptive statistics of differences (by subject)

 

Variable

Median

Mean

Std. Dev.

p-value

% of Positive effect

Exp I

Time

−36.00

−46.30

33.78

7.1e-05

95.00

F-measure

11.50

8.92

20.37

0.048

65.00

Recall

−7.95

−0.90

26.47

0.820

40.00

Precision

18.57

13.10

32.17

0.045

65.00

Exp II

Time

−60.25

−49.00

−19.06

0.001

100.00

F-measure

−5.09

0.53

14.26

0.484

33.33

Recall

12.34

7.69

20.05

0.098

58.33

Precision

−5.62

−5.31

19.50

0.867

41.67

All

Time

−44.00

−49.06

29.01

6.1e-07

96.88

F-measure

5.69

5.74

18.54

0.064

53.13

Recall

−3.30

2.32

24.29

0.430

46.88

Precision

4.62

6.20

29.18

0.129

56.25

The data presented in bold indicated that the p-value is less than 0.05, i.e., the results are intended as statistically significant

Analysing the results of the first experiment (Exp I) we observe that both the null hypotheses can be rejected. This means that the use of ADAMS Re-Trace significantly affects the time spent by the software engineer to trace links as well as his/her retrieval accuracy. Moreover, we observed that the use of the tool significantly affects the precision achieved by the software engineer (p-value = 0.045), although it did not help to trace more correct links (De Lucia et al. 2007a).

Considering the second experiment (Exp II), only the first null hypothesis \((H_{0_{t}})\) can be rejected. This result confirms that the tool statistically reduces the time spent by the software engineer to trace links. Unfortunately, we cannot reject the second null hypothesis \((H_{0_{ta}})\). In particular, the median difference of F-measure is close to zero, indicating that subjects achieved comparable tracing accuracy with or without the tool support. Nevertheless, subjects improved the recall (the mean difference is 7.69) when they perform the task with the tool support, but they were able to achieve a better precision performing the task manually (the mean difference is −5.31). This result is in contrast with the result achieved in the first experiment, where subjects performing the task with the tool support improved the precision (the mean difference is 13.10), but they did not improve the recall (the mean difference is −0.90).

The analysis of all data revealed that only the null hypothesis \(H_{0_{t}}\) can be rejected. Moreover, as we can see in Table 6, the mean differences are positive for all the other dependent variables (i.e., F-measure, recall and precision) revealing an improvement (not statistically evident) of such metrics when subjects performed the task with the tool support.

5.2 Influence of Experience and Ability

In this subsection we analyse the effect of experience and ability on the time spent and the tracing accuracy of subjects. The analysis was performed on the whole data set, similarly to Ricca et al. (2007), Wohlin et al. (2000). This permitted the use of parametric statistics and it was possible since (i) the experiment design, material and procedure was exactly the same, and (ii) the way subjects’ ability has been evaluated was the same. It is worth noting that the only difference among students of Exp I and Exp II, i.e., students’ experience, was considered as an experimental factor.

Concerning the time, ANOVA did not reveal any significant interaction between Method and Experience (p-value = 0.644) and between Method and Ability (p-value = 0.472). This means that we cannot reject both the hypotheses \({\rm H}_{0_{et}}\) and \({\rm H}_{0_{at}}\). Moreover, the analysis did not reveal any significant effect of both Experience and Ability on the time (the p-values are 0.313 and 0.488, respectively). Three way ANOVA by Method, Experience and Ability did not also reveal any three-way interaction (p-value = 0.273). Thus, also \({\rm H}_{0_{eat}}\) cannot be rejected. As a conclusion, the only factor affecting the time was the method used.

As shown in the previous subsection, different subjects achieved different tracing accuracy when they performed the task with the tool support. This suggests that, possibly, other factors (e.g., Ability and/or Experience) could have influenced the F-measure or interacted with the main factor of our study (Method). Descriptive statistics for the F-measure, classified according to Ability and Experience, are shown in Table 7. Tables 8 and 9 show the results of the two-way ANOVA by Method & Experience and by Method & Ability, respectively. Other than confirming the absence of a significant effect due to Method factor, ANOVA indicated significant effect of both Experience (p-value = 1.0e-04) and Ability (p-value = 7.5e-05). Moreover, ANOVA also revealed a significant interaction between Ability and the main factor (p-value = 0.003), i.e., Method, while no significant interaction was revealed between Method and Experience (p-value = 0.225). This means that we cannot reject the hypothesis \({\rm H}_{0_{eta}}\), while we can reject the hypothesis \({\rm H}_{0_{ata}}\).
Table 7

F-measure by Ability and Experience: descriptive statistics

Experience/ability (# obs.)

MT

AT

Median

Mean

Std. dev.

Median

Mean

Std. dev.

Any/low (32)

54.20

48.43

13.73

64.00

64.08

9.72

Any/high (32)

73.09

71.66

11.97

72.22

67.55

14.29

Low/any (40)

53.41

52.38

16.06

65.47

61.36

17.16

High/any (24)

75.46

71.59

14.50

87.50

85.25

13.40

Low/low (22)

50.63

44.79

15.28

62.86

62.30

8.49

Low/high (18)

63.13

63.29

7.00

60.00

61.72

14.53

High/low (10)

55.00

56.44

2.52

73.54

68.00

12.10

High/high (14)

80.84

82.41

7.30

73.48

75.06

10.57

Table 8

ANOVA table of F-measure by Method and Experience

Source

DF

Sum of squares

Mean square

F value

p-value

Method

1

533.300

533.300

3.034

0.087

Experience

1

3055.600

3055.600

17.387

1.0e-04

Interaction

1

264.300

264.300

1.504

0.225

Residual

60

10544.400

175.700

  

Total

63

14397.600

   

The data presented in bold indicated that the p-value is less than 0.05, i.e., the results are intended as statistically significant

Table 9

ANOVA table of F-measure by Method and Ability

Source

DF

Sum of squares

Mean square

F value

p-value

Method

1

533.300

533.300

3.384

0.071

Ability

1

2850.500

2850.500

18.090

7.5e-05

Interaction

1

1550.900

1550.900

9.900

0.003

Residual

60

9453.900

157.600

  

Total

63

14397.600

   

The data presented in bold indicated that the p-value is less than 0.05, i.e., the results are intended as statistically significant

The significant interaction between Method and Ability can be better analysed by looking at the interaction plot shown in Fig. 3a. The figure indicates that High ability subjects achieved, on average, a better F-measure level performing the task manually, while Low ability subjects achieved, on average, better tracing accuracy performing the task with the tool support.
Fig. 3

Interaction between Method and Ability: effect on F-measure (a) and on precision (b)

In order to better assess the role of the Ability during the traceability recovery process, we also analysed the effect of such a factor on both recall and precision. The results confirmed the absence of a significant effect of the Method on both recall (p-value = 0.584) and precision (p-value = 0.193). Moreover, ANOVA indicated a direct significant effect of Ability on both recall (p-value = 0.023) and precision (p-value = 0.002). In the latter case, there is also a significant interaction of Method with Ability (p-value = 0.021). Such a result can be better analysed by looking at the interaction plot shown in Fig. 3b. The figure shows that the interaction between Method and Ability affects the precision level in the same way it affects the F-measure.

The role of the Experience during the traceability recovery process is similar to the role played by the Ability. In particular, even if ANOVA did not reveal any statistical interaction between Method and Experience, we also analysed the effect of such a factor on both recall and precision. The results revealed the absence of a significant effect of the Method on both recall (p-value = 0.581) and precision (p-value = 0.208). ANOVA also indicated a direct significant effect due to the Experience on both recall (p-value = 0.006) and precision (p-value = 0.007). In the latter case, there is also a significant interaction of Method with Experience (p-value = 0.049). Analysing the interaction we observed that, similarly to the Ability, High experience subjects achieved, on average, a better precision level performing the task manually, while Low experience subjects increased the precision level performing the task with the tool support.

Three-way ANOVA by Method, Experience and Ability indicated no three-way interaction (p-value = 0.988), while confirming the presence of a two-way interaction between Method and Ability (p-value = 6.7e-04). Therefore, we could not find any statistical support to also reject \(H_{0_{eata}}\).

5.3 Influence of Other Factors

We considered all data focusing the attention on the effect of other independent variables, i.e. Lab and Task, on both time and F-measure. Concerning the first dependent variable, i.e., time, ANOVA revealed no significant effect of Lab (p-value = 0.067) and Task (p-value = 0.499), as well as no significant interaction between factors.

Regarding the tracing accuracy of the software engineer, i.e., F-measure, ANOVA indicated no significant effect of the Lab (p-value = 0.948). Considering the effect of Task (see Table 10), ANOVA revealed a significant effect of such a factor on the F-measure (p-value = 0.008). As we can see, the analysis also confirmed the absence of a statistical effect of the Method.
Table 10

ANOVA table of F-measure by method and task

Source

DF

Sum of squares

Mean square

F value

p-value

Method

1

533.300

533.300

2.673

0.107

Task

1

1507.400

1507.400

7.555

0.008

Interaction

1

385.900

385.900

1.934

0.169

Residual

60

11971.000

199.500

  

Total

63

14397,600

   

The data presented in bold indicated that the p-value is less than 0.05, i.e., the results are intended as statistically significant

In order to better understand the effect of Task on F-measure we also analysed the effect of Task on both recall and precision. Concerning the recall, ANOVA revealed the absence of a significant effect due to both Method (p-value = 0.586) and Task (p-value = 0.756). Moreover, ANOVA indicated a significant interaction of Method with Task (p-value = 0.010). Figure 4 shows the interaction plot. As we can see, the figure highlights a strong interaction between Method and Task indicating that performing task T1 subjects achieved, on average, a better recall level without the tool support, while performing task T2 subjects achieved, on average, a better recall level using the tool. Finally, focusing on the precision, ANOVA confirmed the absence of a statistical effect due to Method (p-value = 0.191) and it also indicated a significant effect due to Task (p-value = 1.2e-4). Moreover, ANOVA did not reveal any interaction of Method with Task (p-value = 0.903).
Fig. 4

Effect of Method & Task on recall

5.4 Survey Questionnaire Results

We analysed the feedback provided by subjects after each Lab aiming at better understanding the experimental results. Figures 5, 6, and 7 show the boxplots of the results grouped by Task, Experience, and Ability, respectively. Statistical significance of differences has been tested using a paired Wilcoxon one-tailed test (to analyse the effect of Task) and a Mann-Whitney one-tailed test (Conover 1998) (to analyse the effect of Experience and Ability).
Fig. 5

Answers of subjects by Task

Fig. 6

Answers of subjects by Experience

Fig. 7

Answers of subjects by Ability

By looking at the more general questions, we noticed that, overall, subjects had enough time to perform the task (Q1). However, as expected, first year master students experienced more difficulty performing the tasks (Q5) than second year master students (p-value = 0.007). The same happened also for low ability subjects (p-value = 0.010), while no significant difference emerged between tasks (p-value = 0.384). The objectives (Q3) and the laboratory tasks (Q4) were clear, even if, as expected, second year master students better understood both the objectives (p-value = 0.034) and the laboratory tasks (p-value = 0.031). Concerning such aspects no significant difference emerged between subjects with different levels of ability. Regarding the system domain knowledge (Q2), the results show that during the experiment subjects had a perception of an acceptable domain knowledge and no significant difference between subjects and between tasks emerged regarding such an aspect.

Questions related to the use of the tool revealed that in general the use of the tool was clear (Q6). No significant difference emerged between subjects and tasks. Regarding the usefulness of the set Proposed Links (Q7) subjects generally found the suggestions of the tool useful. However, subjects considered more useful the tool performing the task T2 (p-value = 0.005). Finally, subjects generally preferred to apply the incremental approach in both tasks (Q8) and no significant difference emerged between subjects.

6 Threats to Validity

In this section we discuss the threats to validity that can affect our results, focusing the attention on conclusion, construct, internal, and external validity threats.

6.1 Conclusion Validity

Conclusion validity concerns the relationship between the treatment and the outcome. Attention was paid to not violate assumptions made by statistical tests. Whenever conditions necessary to use parametric statistics did not hold (e.g., analysis of each experiment data), we used non-parametric tests, in particular the Wilcoxon test for paired analyses. We dealt with random heterogeneity of subjects by introducing the Ability and the Experience factors, and analysing their interaction with Method, as well as the three-way interaction among the three factors.

Finally, survey questionnaires, mainly intended to get qualitative insights, were designed using standard ways and scales (Oppenheim 1992). This allowed us to use statistical test, i.e., Mann-Whitney one-tailed test, for also analysing differences in the feedback.

6.2 Construct Validity

Construct validity threats concern the relationship between theory and observation. Being an aggregate measure of precision and recall, the F-measure well reflects the global retrieval accuracy of the software engineer. It is worth noting that in some cases, especially in industry (Hayes et al. 2006; Lormans et al. 2006; Zou et al. 2007), a high recall may be desirable. Even if recall is very important, we also wanted to see the differences in precision between subjects in the different cases, because the IR-based traceability recovery tools tend to balance a high recall with a low precision. By analysing the F-measure we were also able to take into account false positives traced as correct links by the software engineers (i.e., the tracing precision) with the two approaches and to get an overall balance of recall and precision.

It is important to note that the point of view of our definition of precision and recall is not of a tool proposing candidate links, but of a software engineer tracing links. Thus, while some of these links might be correctly traced other can be wrongly traced. Moreover, we need to distinguish between manual and tool based traceability recovery. In the first case (manual trace) there is no tool support and then no candidate links: in this case we just evaluate the tracing accuracy of the software engineer. In the latter case, the tool suggests some candidate links (retrieved links) and the software engineer makes a classification of these links: links considered correct are traced, while links considered as false positives are discarded. Besides the fact that some correct links might not be recovered and suggested by the tool, it is worth noting that the suggestion of the tool might affect the behaviour of the software engineer, for example inducing him/her to trace false positives. Moreover, the software engineer can also discard some correct links. So, in this case we evaluate how the support of a tool affects the tracing accuracy of the software engineer.

Another point is related to our definition of tracing errors. In general, tracing errors include errors of inclusion and errors of exclusion, while with tracing errors we are referring to inclusion (of false positives) errors. Indeed, while when using the tool the software engineer can discard some correct links proposed by the tool (exclusion error), this is not the case when conducting the task manually, as there is no candidate link. Thus, it is not possible to compare exclusion errors between the two approaches; rather we just consider recall to evaluate how many correct links have not been traced either with the tool support or manually.

Concerning the interactions between different treatments, they were mitigated by the chosen experimental design. Regarding the levels of the Ability, we used only two levels, that is Low and High, discriminating students taking into account the average grades they obtained at the previous exams. Clearly, more levels than Low and High could have been used. Nevertheless, analyses performed with more levels did not yield any different or contrasting results. To avoid social threats due to evaluation apprehension, students were not evaluated on their tracing accuracy in the Lab. Moreover, subjects were not aware of the experimental hypotheses.

Another important threat is related to the process adopted during the traceability link identification with the tool support. In particular, we did not impose the use of a particular approach, i.e., the “one-shot” process (where the full ranked list of candidate links is proposed) and the incremental process (where a similarity threshold is used and the links are classified step-by-step). Indeed, during the training phase we presented both the traceability recovery processes. Moreover, although we knew the distribution of the links in the specific case, we have not suggested any guide to the students, as this would have biased the experiment. Furthermore, previous studies have shown how such a distribution changes depending on the project and on the type of artefacts, so it is not possible to define general guidance (De Lucia et al. 2007b). Thus, the subjects involved in our experimentation only knew that IR tools result in a high density of correct links in the upper part of the ranked list and in a low density of correct links in the lower part of the list.

Finally, our experimentation required that the traceability recovery task was completed without interruption to keep variables under control. On the other hand, traceability management should be conducted in several sessions where the software engineers and the expert of the application domain might incrementally recover links and also change their mind after some discussions. Probably, better results (in particular, better recall) could be achieved if the traceability recovery tasks were performed across several sessions.

6.3 Internal Validity

Internal validity concerns external factors that may affect an independent variable. An important threat is related to the learning effect experienced by subjects between labs. This is mitigated thanks to the experiment design: subjects worked, over the two labs, on different traceability tasks and using two different levels of the main factor (tracing links with and without the tool support). Concerning the traceability recovery tasks, subjects had to recover traceability links between use case and code classes and between interaction diagrams and test cases. It is worth noting that it was closer to reality tracing use cases onto test cases. However, since we had only four types of artefacts, we decided to use two types of artefacts in a task and the other two types of artefacts in the other task to avoid the learning effect. Even if also the experiment design tries to mitigate the learning effect, there is still the risk that, during labs, subjects might have learned how to improve their tracing accuracy. We tried to limit this effect by means of a preliminary training phase. Moreover, the factor Lab has been accounted as a factor in the analysis of results and the ANOVA analysis showed no significance of Lab. Finally, there was no abandonment, and everything was clear as shown in the survey.

6.4 External Validity

External validity concerns the generalisation of the findings. This kind of threat is always present when experimenting with students. The selected subjects represent a population of students specifically trained on software development technologies and software engineering methods. Also, all students are master students who either had some professional experiences or worked on industrial projects during their Bachelor thesis. This makes these students comparable to industry junior developers.

Another important point is related to the fact that the students are not familiar with the application domain and the artefacts of the software system. In other words this was not their project, but rather a practical task to make as an exercise. We tried to limit such a threat organising meetings before the experiment aiming at giving the students acceptable system domain knowledge. In this way, they did not spend extra time to comprehend the documentation during the traceability recovery task. Unfortunately, we did not make a pre-experiment assessment of the subjects to analyse their knowledge of the application domain, but we only gave from the post-experiment questionnaire a subjective evaluation of the knowledge of the application domain (see question Q2 in Table 4). Of course this evaluation also took into account the training that the students received to get familiarity of the application domain and the software system. In other words, the answers to the questionnaire were rather an auto-evaluation that students made with respect to the training they received. For this reason, these answers have mainly a subjective relevance (the positive or negative answer also depends on the self-esteem level and on the strong or weak personality of students) and for this reason we did not consider such information in our statistical analysis.

The experience of students involved in the experimentation could be another important threat. In particular, we observed that a limited increase in experience (first year vs second year master students) impacts the tracing accuracy. However, such a result probably cannot be completely generalised in order to say that all experience, or that differences in workplace experience would have the same impact. Moreover, the working pressure and the overall environment in industry is different, thus replicating the study in industry is highly desirable.

Concerning the artefact repository used in the experiment, it is worth noting that it is not comparable to industrial projects, but it is just used to evaluate IR methods to recover traceability links (De Lucia et al. 2007b). However, the type of experimentation has to be conducted in a controlled way and in a limited amount of time. For this reason, it is not easy to use repositories of larger size and even if larger repositories are used, there might anyway be the need to select a subset of source and target artefacts when designing the traceability recovery tasks.

7 Discussion and Lesson Learned

We have conducted experimentations with master students aimed at analysing how the use of ADAMS Re-Trace affects the time spent and the tracing accuracy of subjects during traceability link identification. The achieved results provided us with a number of lessons learned:
  • the tool reduces the time spent to perform the task: the Wilcoxon test reveals that the tool significantly affects the time in both the experiments. This is also confirmed by the ANOVA analysis. In particular, it reveals a significant influence of the method used to trace links on time (p-value = 8.0e-15). Further ANOVA analyses reveal the absence of influence on such factors due to other independent variables, i.e., Task and Lab. All these results strongly supports the rejection of the null hypothesis H0t. Although this result was not really surprising, it is worth noting that the reduction of the time spent by the software engineer using the tool is an important result, as the tediousness of the link identification is one of the main factors contributing to bad traceability link management (Domges and Pohl 1998; Leffingwell 1997);

  • in general, the tool affects the tracing accuracy: the tool positively affect the tracing accuracy of software engineers (F-measure). Unfortunately, the effect of Method is statistically evident only in the experiment performed with subjects with a low level of experience (Exp I). Nevertheless, considering the mean difference of F-measure and recall we observed that it is positive also in Exp II, indicating that there is a positive effect (not statistical evident) of the method on the accuracy of the software engineer. We also observed that the recall in the manual experimental tasks was low. In our opinion this might be due to two factors: (i) the software engineers were not familiar with the application domain and (ii) the traceability recovery task had to be completed in a limited amount of time without interruption. However, the two factors above were also true when the subjects performed the traceability recovery task with the tool support. Probably, with more traceability recovery sessions, the students would have conducted a link coverage analysis and more focused traceability recovery sessions on subsets of source and target artefacts. Actually, from our results, the main benefit of using the tool is a significant effort reduction with comparable tracing accuracy. Obviously, if the software engineer has an unlimited amount of time for this task he/she can potentially recover all links manually, but usually there is no such amount of time to dedicate. Thus, if the tool is able to recover about 60–70% of links with a very much reduced time and good precision, the software engineer can then use some link coverage analysis and decide to start looking for missing links in a more focused way. This second step might also be done without the tool support or using the tool on a reduced and focused subset of source and target artefacts. In particular, we observed that the set of correct links traced with the tool support and the set of correct links manually traced are largely disjoined (on average, the overlap is about 60% of the traced links). Thus, some correct links traced with the tool support were missed performing the traceability recovery task manually and vice versa. In particular, on average about 30% of links traced manually were not recovered with the tool. As the subjects used the incremental process, this means that probably these links were in the lower part of the ranked list. This consideration suggests that the tracing accuracy might be improved performing several traceability recovery sessions combining manual and semi-automatic tracing;

  • experts are able to trace more correct links: in general, experts are able to trace more correct links using the tool but they also make more tracing errors than software engineers with a low level of experience (see Table 6). ANOVA revealed that the better recall achieved by subjects in the second experiment was due to the different level of experience (p-value = 0.006). In particular, subjects involved in the second experiment had previously experienced traceability tasks, so they knew the density of links in a traceability matrix; for this reason they traced more links than subjects with a low level of experience. In particular, they analysed a higher number of links suggested by the tool as they used a lower threshold during the traceability recovery process (p-value = 0.006). On one hand this approach increases the probability to trace more links (better recall), but on the other hand it improves the probability to also trace false positives (worse precision). In particular, we observed that when subjects receive a suggestion by the tool, they decided to trace the link if they have doubts. It is important to note that the achieved result is not completely negative. In particular, especially for impact analysis, it is better to have more links that are correct and some false positives, than less correct links and less false positives;

  • the ability is an influencing factor: subjects with a high ability achieved better tracing accuracy than subjects with low ability. This result confirms that traceability tasks are difficult tasks and a high ability is required to build an acceptable traceability matrix. However, we observed that with a tool support low ability subjects are able to achieve similar tracing accuracy to high ability subjects. Thus, a traceability recovery tool reduces the gap between low and high ability subjects;

  • the approach used to trace links with the tool support might be an influencing factor: even if we did not impose the use of a particular approach almost all the students used the incremental one. Using such an approach students incrementally classified the proposed links and they stopped the traceability link recovery process when they had the perception that the number of false positives was becoming too high with respect to the number of correct links traced. In particular, we observed that when stopping the traceability recovery process using a threshold lower than or equal to 50%2 subjects increased the recall with respect to subjects stopping the process with a threshold higher than 50% (p-value = 0.006). This consideration suggests that better recall could be achieved providing to the software engineer the full ranked list of possible links (ordered by decreasing similarity values). It is worth noting that, in general, the whole ranked list contains a high density of correct links in the upper part and a low density of such links in the lower part. According to these considerations, we expect that the full ranked list approach might result in a better recall but also in a worse precision with respect to the incremental approach, where the software engineer tends to concentrate only on the upper part of the list. This consideration is supported by the fact that subjects that used a high threshold achieved better precision than subjects that used a low threshold (p-value = 0.003);

  • the retrieval accuracy of the IR engine is an influencing factor: ANOVA reveals a significant effect of Task on the retrieval accuracy. Even if the subjects considered the two traceability tasks as of a comparable level of difficulty (in general they did not experience major difficulties performing both the tasks) the results achieved in task T2 were better than the results achieved in task T1 (p-value = 7.2e-04). A one-way ANOVA by Task revealed that the effect of the task is statistically evident when subjects performed the task with the tool support (p-value = 2.0e-04), while the effect is not statistically evident when subjects manually traced links (p-value = 0.442). We also observed an interaction between Method and Task. In particular, subjects performing task T1 achieved, on average, a better recall level without the tool support, while subjects performing task T2 achieved, on average, a better recall level using the tool. All these consideration suggest that the influence of the task is due to the retrieval accuracy of the IR engine of the traceability recovery tool. In particular, the accuracy of the IR engine is better for task T2 (tracing interaction diagrams onto test cases) than for task T1 (tracing use cases onto code classes), as discussed in De Lucia et al. (2007b). Such a situation is also reflected in the perceived usefulness of the tool. In particular, analysing the survey questionnaire we observed that subjects considered the tool more useful when performing the task T2 than the task T1. Unfortunately, the accuracy of IR based tools is not good enough when the retrieval task involves source code artefacts, due to the poor verbosity of such an artefact category. Probably, the retrieval accuracy of IR-based traceability recovery tools can be improved by forcing the programmer to write source code of better quality (in terms of identifiers and comments), with the aim to improve the textual similarity between source code and high level artefacts (De Lucia et al. 2006b). In this way, IR-based traceability recovery tools should be able to provide a more effective support.

8 Conclusion and Future Work

In the last decade several IR-based tools have been proposed to support the software engineer during the link identification. Unfortunately, the authors of previous works concentrate on evaluating the accuracy of the automated tool on software repositories of completed projects. Indeed, a typical tracing process requires a human analyst to make final decisions. Thus, studying human decision-making when working with automated traceability tools is needed in order to predict the actual benefits of IR-based traceability recovery tools.

This paper reported the results of two controlled experiments aimed at investigating the benefits of the LSI-based traceability recovery tool of ADAMS (De Lucia et al. 2007b, 2008). The experimentation involved master students with different levels of experience in two traceability tasks (with and without the use of the tool). Having different groups of subjects allowed us to analyse the reaction of different categories of users when they use a traceability recovery tool. We also discriminated subjects of both experiments according to the respective level of ability, with the purpose of testing the hypothesis that this is also a relevant influencing factor that should be taken into account when adopting such kinds of tools.

The achieved results demonstrate that the tool significantly reduces the time spent by the software engineer to trace links. Moreover, the comparative analysis of the results of both the experiments demonstrates that the use of a traceability recovery tool in general improves the tracing accuracy of a software engineer. In particular, the tool helps software engineer with low ability to achieve similar tracing accuracy as software engineers with high ability. Finally, we also observed that experts are able to trace more correct links using the tool but they also make more tracing errors than software engineers with a low level of experience. The analysis of the results also provided us with a number of lessons learned that can be used to establish the baseline expectations and a roadmap for testing the work of human experts with traceability tools. Moreover, they also highlight the possible outside influences (such as specific analyst characteristics) on the results of analyst work. Summarising, the achieved results could be used by researchers in order to improve the support given by a traceability recovery tool based on an IR technique during the link identification; and by a project manager in order to evaluate the possibility of adopting the tool within his/her own organisation, depending on the skills of the involved human resources.

As it always happens with empirical studies, replication in different contexts, with different subjects and objects, is the only way to corroborate our findings. Replicating the experiment with professional software engineers and adding new treatments with the aim to explicitly consider different approaches to trace links with the tool support are part of the agenda of our future work. In particular, we plan to compare different traceability recovery processes, i.e., the incremental and the “one-shot” processes. Moreover, replicating the experiment giving the possibility to subjects to perform a link coverage analysis and more focused traceability recovery sessions is also useful to analyse if such an approach is actually useful in order to improve the tracing accuracy of the software engineer.

Footnotes

  1. 1.

    We decided to select such a threshold as it represents the median of the possible grades for any exam to be passed by a student in an Italian University (min 18/30 and max 30/30).

  2. 2.

    We decided to use such a threshold to discriminate between low and high thresholds as it represents the median of the possible thresholds used to cut the ranked list.

Notes

Acknowledgements

We would like to thank the anonymous reviewers for their detailed, constructive, and thoughtful comments that helped us to improve the presentation of the results in this paper. We are very grateful to Dr. Massimiliano Di Penta of University of Sannio, Italy, for his constructive comments that helped us to improve the presentation of the experimental results in this paper. Special thanks are also due to the students who were involved in the experiment as subjects. The work described in this paper is supported by the project METAMORPHOS (MEthods and Tools for migrAting software systeMs towards web and service Oriented aRchitectures: exPerimental evaluation, usability, and tecHnOlogy tranSfer), funded by MiUR (Ministero dell’Università e della Ricerca) under grant PRIN-2006-2006098097.

References

  1. Antoniol G, Casazza G, Cimitile A (2000a) Traceability recovery by modelling programmer behaviour. In: Proceedings of 7th working conference on reverse engineering, vol 240–247. IEEE CS, BrisbaneGoogle Scholar
  2. Antoniol G, Canfora G, Casazza G, De Lucia A (2000b) Identifying the starting impact set of a maintenance request. In: Proceedings of 4th European conference on software maintenance and reengineering. IEEE CS, Zurich, pp 227–230CrossRefGoogle Scholar
  3. Antoniol G, Canfora G, Casazza G, De Lucia A, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Softw Eng 28(10):970–983CrossRefGoogle Scholar
  4. Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley, ReadingGoogle Scholar
  5. Basili VR, Selby RW, Hutchens DH (1986) Experimentation in software engineering. IEEE Trans Softw Eng 12(7):758–773Google Scholar
  6. Bruegge B, De Lucia A, Fasano F, Tortora G (2006) Supporting distributed software development with fine-grained artefact management. In: Proceedings of 2nd international conference on global software engineering. Florianopolis, 16–19 October 2006, pp 213–222Google Scholar
  7. Cleland-Huang J, Settimi R, Duan C, Zou X (2005) Utilizing supporting evidence to improve dynamic requirements traceability. In: Proceedings of 13th IEEE international requirements engineering conference. IEEE CS, Paris, pp 135–144CrossRefGoogle Scholar
  8. Conover WJ (1998) Practical nonparametric statistics, 3rd edn. Wiley, New YorkGoogle Scholar
  9. Cullum JK, Willoughby RA (1998) Lanczos algorithms for large symmetric eigenvalue computations, vol 1, chapter real rectangular matrices. Birkhauser, BostonGoogle Scholar
  10. De Lucia A, Oliveto R, Sgueglia P (2006a) Incremental approach and user feedbacks: a silver bullet for traceability recovery. In: Proceedings of 22nd IEEE international conference on software maintenance. IEEE CS, Philadelphia, pp 299–309CrossRefGoogle Scholar
  11. De Lucia A, Di Penta M, Oliveto R, Zurolo F (2006b) Improving comprehensibility of source code via traceability information: a controlled experiment. In: Proceedings of 14th IEEE international conference on program comprehension. IEEE CS, Athens, pp 317–326CrossRefGoogle Scholar
  12. De Lucia A, Fasano F, Francese R, Tortora G (2004) ADAMS: an artefact-based process support system. In: Proceedings of 16th international conference on software engineering and knowledge engineering. KSI, Banff, pp 31–36Google Scholar
  13. De Lucia A, Oliveto R, Tortora G (2007a) Recovering traceability links using information retrieval tools: a controlled experiment. In: Proceedings of international symposium on grand challenges in traceability. ACM, Lexington, pp 46–55Google Scholar
  14. De Lucia A, Fasano F, Oliveto R, Tortora G (2007b) Recovering traceability links in software artefact management systems using information retrieval methods. ACM Trans Softw Eng Methodol 16(4):13CrossRefGoogle Scholar
  15. De Lucia A, Oliveto R, Tortora G (2008) ADAMS re-trace: traceability link recovery via latent semantic indexing. In: Proceedings of 30th IEEE/ACM international conference on software engineering. ACM, Leipzig, pp 839–842Google Scholar
  16. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407CrossRefGoogle Scholar
  17. Devore JL, Farnum N (1999) Applied statistics for engineers and scientists. Brooks/Cole, DuxburyGoogle Scholar
  18. Di Penta M, Gradara S, Antoniol G (2002) Traceability recovery in RAD software systems. In: Proceedings of 10th international workshop in program comprehension. IEEE CS, Paris, pp 207–216CrossRefGoogle Scholar
  19. Domges R, Pohl K (1998) Adapting traceability environments to project specific needs. Commun ACM 41(12):55–62CrossRefGoogle Scholar
  20. Duan C, Cleland-Huang J (2007) Clustering support for automated tracing. In: Proceedings of 22nd IEEE/ACM international conference on automated software engineering. ACM, Atlanta, pp 244–253Google Scholar
  21. Dumais ST (1991) Improving the retrieval of information from external sources. Behav Res Meth Instrum Comput 23:229–236Google Scholar
  22. Dumais ST (1993) LSI meets TREC: a status report. In: Proceedings of the first text retrieval conference (TREC-1). NIST Special Publication, pp 137–152Google Scholar
  23. Gotel O, Finkelstein A (1994) An analysis of the requirements traceability problem. In: Proceedings of 1st international conference on requirements engineering. IEEE CS, Colorado Springs, pp 94–101CrossRefGoogle Scholar
  24. Harman D (1992) Information retrieval: data structures and algorithms, chapter ranking algorithms. Prentice-Hall, Englewood Cliffs, pp 363–392Google Scholar
  25. Hayes JH, Dekhtyar A, Osborne J (2003) Improving requirements tracing via information retrieval. In: Proceedings of 11th IEEE international requirements engineering conference. IEEE CS, Monterey, pp 138–147Google Scholar
  26. Hayes JH, Dekhtyar A, Sundaram SK (2006) Advancing candidate link generation for requirements tracing: the study of methods. IEEE Trans Softw Eng 32(1):4–19CrossRefGoogle Scholar
  27. Juristo N, Moreno A (2001) Basics of software engineering experimentation. Kluwer Academic, DordrechtMATHGoogle Scholar
  28. Leffingwell D (1997) Calculating your return on investment from more effective requirements management. Technical report, Rational Software CorporationGoogle Scholar
  29. Lin J, Lin CC, Cleland-Huang J, Settimi R, Amaya J, Bedford G, Berenbach B, Khadra OB, Duan C, Zou X (2006) Poirot: a distributed tool supporting enterprise-wide automated traceability. In: Proceedings of 14th IEEE international requirements engineering conference. IEEE CS, Minneapolis, pp 356–357Google Scholar
  30. Lormans M, van Deursen A (2006) Can LSI help reconstructing requirements traceability in design and test? In: Proceedings of 10th European conference on software maintenance and reengineering. IEEE CS, Bari, pp 45–54Google Scholar
  31. Lormans M, Gross H, van Deursen A, van Solingen R, Stehouwer A (2006) Monitoring requirements coverage using reconstructed views: an industrial case study. In: Proceedings of 13th working conference on reverse enginering. IEEE CS, Benevento, pp 275–284CrossRefGoogle Scholar
  32. Marcus A, Maletic JI (2003) Recovering documentation-to-source-code traceability links using latent semantic indexing. In: Proceedings of 25th international conference on software engineering. IEEE CS, Portland, pp 125–135CrossRefGoogle Scholar
  33. Marcus A, Xie X, Poshyvanyk D (2005) When and how to visualize traceability links? In: Proceedings of 3rd international workshop on traceability in emerging forms of software engineering. ACM, Long Beach, pp 56–61CrossRefGoogle Scholar
  34. Oliveto R (2008) Traceability management meets information retrieval methods: strengths and limitations. PhD thesis, University of Salerno, March. www.sesa.dmi.unisa.it/thesis/oliveto.pdf
  35. Oppenheim AN (1992) Questionnaire design, interviewing and attitude measurement. Pinter, LondonGoogle Scholar
  36. Pfleeger SL, Menezes W (2000) Marketing technology to software practitioners. IEEE Softw 17(1):27–33CrossRefGoogle Scholar
  37. Pinhero FAC, Goguen JA (1996) An object-oriented tool for tracing requirements. IEEE Softw 13(2):52–64CrossRefGoogle Scholar
  38. Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137Google Scholar
  39. Ricca F, Di Penta M, Torchiano M, Tonella P, Ceccato M (2007) The role of experience and ability in comprehension tasks supported by UML stereotypes. In: Proceedings of 29th international conference on software engineering. IEEE Computer Society, Minneapolis, pp 375–384Google Scholar
  40. Settimi R, Cleland-Huang J, Ben Khadra O, Mody J, Lukasik W, De Palma C (2004) Supporting software evolution through dynamically retrieving traces to UML artifacts. In: Proceedings of 7th IEEE international workshop on principles of software evolution. IEEE CS, Kyoto, pp 49–54CrossRefGoogle Scholar
  41. Wohlin C, Runeson P, Host M, Ohlsson MC, Regnell B, Wesslen A (2000) Experimentation in software engineering—an introduction. Kluwer, DeventerMATHGoogle Scholar
  42. Yadla S, Huffman Hayes J, Dekhtyar A (2005) Tracing requirements to defect reports: an application of information retrieval techniques. Innov Syst Softw Eng NASA J 1(2):116–124CrossRefGoogle Scholar
  43. Zou X, Settimi R, Cleland-Huang J (2007) Term-based enhancement factors for improving automated requirement trace retrieval. In: Proceedings of international symposium on grand challenges in traceability. ACM, Lexington, pp 40–45Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Andrea De Lucia
    • 1
  • Rocco Oliveto
    • 1
  • Genoveffa Tortora
    • 1
  1. 1.Department of Mathematics and InformaticsUniversity of SalernoFisciano (SA)Italy

Personalised recommendations