Introduction

The incremental and iterative software life cycle is based on the idea of developing an initial system implementation and evolving it through several releases in a cyclic way [1]. Release planning (RP) addresses all decisions related to the requirements selection and assignment to a consecutive sequence of releases [2]. As stated by Ruhe and Saliu [3], good RP practices ensure that the software is built on providing the maximum business value by offering the best possible blend of features in the right sequence of releases. On the other hand, poor RP decisions can result in the following: (i) unsatisfied customers; (ii) release plans that are unlikely to be delivered within a given schedule, quality, and effort specifications; and (iii) release plans that do not offer the best business values.

Given the cognitive effort involved in dealing with RP, defining a “suitable” set of releases is inherently challenging. As “suitable,” we may consider that it properly deals with variables that present complex relations such as stakeholders’ preferences, technical constraints, limited resources, and subjective business aspects. In addition, this process can be time-consuming, requiring one to analyze an exhaustive list of possible combinations, which tend to be extremely large if the number of requirements is great. There are a number of existing approaches that are based on the belief that RP can be formalized as an optimization problem and widely explored by Search-based software engineering (SBSE) as well. In summary, SBSE proposes to apply search techniques to solve complex problems of software engineering [4].

However, RP is a wicked problem with a large focus on human intuition, communication, and human capabilities. For a wicked problem, it is not clear what the problem is and, therefore, what the solution is [3]. Additionally, turning the decision maker’s (DM) feelings as a useful part of the resolution process may help avoid some resistance or little confidence in the final result [5]. Instead of just providing a simple weight factor for each requirement, for example, we emphasized the importance of providing a refined mechanism to efficiently capture the human preferences and, consequently, guide the search process. This mechanism must intuitively enable the DM to express his/her preferences in a broad scope, focusing the time in essential subjective aspects. As subjective aspects, we refer to the questions that are complex to define without human interaction, especially implicit information. For instance, the DM may want to allocate specific features in different releases, establish precedence relations or coupling relations between features according to his/her subjective knowledge.

In other words, we have to integrate the computational intelligence power with the human expertise to obtain more realistic and acceptable solutions for some wicked problems. In general, two main benefits arise from this perspective, which are to provide meaningful insights to the DM and increase the human engagement [6]. As discussed by Marculescu et al. [7], intuitive interaction with domain specialists is a key factor in industrial applicability, since it makes the system more usable and more easily accepted in an industrial setting. Despite the promising outlook, the definition regarding the type and how the preferences are exploited by optimization algorithms is a relevant challenge and has attracted attention in recent years.

Recently, the SBSE approaches based on this assumption have been discussed under the requirements engineering context. Araújo et al. [8] propose an architecture for the next release problem (NRP) based on the usage of an interactive genetic algorithm alongside a machine learning model. In that work, the preferences are gathered through a subjective mark provided by the DM to each solution, while a machine learning model learns his/her evaluation profile. After a certain number of subjective evaluations, the machine learning model replaces the DM and evaluates the remainder of the solutions. While considering the NRP, Ferreira et al. [9] propose an interactive ant colony optimization, where the DM is asked to specify which requirements he/she expects to be present or not in the next release. While these previous studies are focused on the requirements selection, Tonella, Susi, and Palma [10] present an interactive approach to the requirements prioritization. The information elicited from the user consists of a pairwise comparison between the requirements that are ordered differently in equally scored prioritization.

Regarding planning more than one release, an initial proposal of the present work was described by Dantas et al. [11]. It was designed as a single-objective model, which allows the DM to express different types of preferences considering the requirements allocation. Such preferences are stored in the preference base, in which the main purpose is to influence the search process. It was verified in the performed experiment that there was an unavoidable trade-off between the problem’s metrics (score and risk) and the subjective preference. This conflict occurs because sometimes the solutions have to lose some value in score or risk to satisfy the DM’s preferences.

To extend Dantas et al.’s [11] proposal, an earlier version of this work is investigated to deal with the DM preferences as another objective to be maximized in a multi-objective model [12]. Notwithstanding, the proposed approach included a strategy called the reference point method [13] to mitigate the usual cognitive effort of selecting a solution from the Pareto front. Empirical results were able to demonstrate the feasibility of the approach in an artificial environment.

Therefore, this paper significantly extends the previous work in two major aspects: (a) besides increasing the automatic experiment through two more search techniques, one large artificial dataset, and additional results, we also conduct a participant-based experiment to observe the behavior of the approach in a real-world context and (b) a prototype tool was developed and made available to enable a novel way to incorporate the human preferences during the release planning. The primary contributions of this paper can be summarized as follows:

  • Experimental analyses considering both simulated and real human evaluations

  • The presentation of a prototype tool for the release planning process

The remainder of this paper is organized as follows. The “Background” section presents the approach background, whereas the “Mathematical formulation” section details the mathematical model. The “Empirical study” section discusses the empirical study, and finally, the “Conclusions” section presents some conclusions and directions for future works.

Background

Search-based software engineering

Software engineers often face problems associated with balancing competing constraints, trade-off between concerns, and requirement imprecision. Software engineering is typically concerned with a near optimal solution or those that fall within a specified acceptable tolerance [14]. In these situations, automated optimization techniques are natural candidates. Thus, search-based software engineering (SBSE) seeks to reformulate software engineering problems as search problems. A search problem is one in which optimal or near optimal solutions are sought in a search space of candidate solutions [4].

As highlighted by Harman [15], there are only two key ingredients for the application of search-based optimization to software engineering problems:

  • The choice of the problem representation amenable to symbolic manipulation

  • The definition of the fitness function to characterize what is considered to be a good solution

SBSE has been applied to many fields within the general area of software engineering, such as requirements engineering [16], software design [17], and testing [18]. A wide range of different optimization and search techniques have been used by SBSE, with evolutionary algorithms (EAs) being the most popular ones [19]. EAs are generic and stochastic population-based algorithms that are inspired by biological and natural evolution concepts, such as reproduction, mutation, recombination, and selection [20]. In this work, we evaluated four different EAs, namely NSGA-II [21], MOCell [22], IBEA [23], and SPEA-II [24].

Modeling preferences for the search-based release planning

This work follows the concepts proposed by Dantas et al. [11] and includes the human’s preferences in the release planning using search-based techniques. As shown in Fig. 1, such an approach is composed of three components: interactions manager, preference base, and optimization process.

Fig. 1
figure 1

Approach proposed by Dantas et al. [11]

The interactions manager is responsible for the user’s interactions such as adding, modifying, or removing preferences; visualizing the best solutions; and initializing or finalizing the search process. The preference base stores every preference to facilitate the relevant information acquisition, for instance, the number of preferences that are in the base and satisfied by the solution. Finally, the optimization process is responsible for providing a solution through the search technique guided by the preference base.

An important aspect to be highlighted is how the DM can express his/her preferences about the requirements allocation throughout the releases. The authors have formalized eight types of preferences with different purposes, respectively named as coupling joint, coupling disjoint, positioning precede, positioning follow, positioning before, positioning after, positioning in, and positioning no.

The next section presents further details about the DM’s preferences and mathematical formalization proposed by Dantas et al. [11].

Mathematical formulation

Consider that R={r i | i=1,2,3,⋯,N} is the set of requirements available to be allocated to a set of releases K={k q | q=0,1,2,⋯,P}, where N and P are the number of requirements and releases, respectively. The vector S={x 1,x 2,⋯,x N } represents the solution, where x i ∈{0,1,2,⋯,P} stores the release k q in which the requirement r i is allocated, and x i =0 means that such a requirement was not allocated. In addition, consider C={c j | j=1,2,3,⋯,M}, where M is the number of clients and each client c j has a weight w j to estimate his/her importance to the company that develops the software. The function Value(i), which represents how valuable requirement r i is, returns the weighted sum of the scores that each client c j assigned to the requirement r i as follows:

$$ \text{Value}(i)= \sum_{j=1}^{M}w_{j} \times \text{Score}(c_{j},r_{i}), $$
(1)

where Score(c j ,r i ) quantifies the perceived importance that a client c j associates with a requirement r i assigning a value ranging from 0 (no importance) to 10 (the highest importance). Thus, the value of the objective related to the overall client satisfaction is given by

$$ \text{Satisfaction}(S)= \sum_{i=1}^{N}(P-x_{i}+1) \times \text{Value}(i) \times y_{i}, $$
(2)

where y i ∈{0,1} is a decision variable that has a value of 1 if the requirement r i is allocated to some release and 0 otherwise. This binary variable is necessary to avoid a requirement r i being computed when it is not allocated. As suggested by Baker et al. [25], the clients are usually satisfied when the requirements they most prefer are implemented. Therefore, (Px i +1) is used for Satisfaction(S) to become higher when the requirements with a high Value(i) are allocated in the first releases, i.e, maximizing the overall clients’ satisfaction.

In addition to maximizing the client’s satisfaction, another relevant aspect to the project is to consider the minimization of the implementation risk of each requirement. We consider the “implementation risk” as the quantification of the danger to the project caused by the possible postponement of implementation for a defined requirement. The bigger the risk, the bigger is the probability of project failure. Identifying and dealing with risks early in development lessen long-term costs and help prevent software disasters [26]. According to Brasil et al. [27], the risk may be defined in terms of the risk impact analysis for the client’s business and its probability of occurrence. As seen in Table 1, each value assume a range from 1 to 9 considering the level of impact (low, medium, and high) and of probability occurrence (low, medium, and high). For example, consider a requirement r 1 having a high negative impact on the business with a low chance of happening. This requirement will present a risk value of 7, whereas a requirement r 2 presents a risk value of 9 whether it presents a high negative impact with a maximum chance of happening.

Table 1 Impact analysis versus probability of occurrence

Thus, consider D={d 1,d 2,…,d N } where each d i is the risk value associated with the requirement r i . The value of the objective related to risk of a solution is defined as follows:

$$ \text{Risk}(S) = \sum_{i=1}^{N}x_{i} \times d_{i}, $$
(3)

where the value of Risk(S) is smaller when the requirements with the highest risk are allocated to the first releases and, consequently, the overall project risk is minimized.

We assume that there are relations among releases and requirements which may be defined by the DM according to his/her subjective solution analysis. Consequently, he/she may obtain some insights about the problem and even adapt the decision criteria. As formalized by Dantas et al. [11], these relations may be expressed by the DM through defining a set of preferences. The process of defining these preferences follows the formalization below:

  1. 1.

    Coupling joint

    • Representation: coupling_joint(r i ,r j ).

    • Parameters: Requirements r i R and r j R.

    • Interpretation: It is used to express that a DM wishes a requirement r i to be placed together with a requirement r j .

    • Formal interpretation: Is satisfied if, and only if, x i =x j .

  2. 2.

    Coupling disjoint

    • Representation: coupling_disjoint(r i ,r j ).

    • Parameters: Requirements r i R and r j R.

    • Interpretation: It allows a DM to allocate r i and r j to different releases.

    • Formal interpretation: It is satisfied if, and only if, x i x j .

  3. 3.

    Positioning precede

    • Representation: positioning_precede(r i ,r j ,[dist]).

    • Parameters: Requirements r i R, r j R and a minimum distance between the requirements, with a value always greater than zero.

    • Interpretation: It enables a DM to specify that a requirement r i must be positioned at least dist releases before a requirement r j .

    • Formal interpretation: It is satisfied if at least one of the following conditions is fulfilled:

      (x i ,x j ≠0 and x j x i ≥dist) OR

      (x i ≠0 and x j =0).

  4. 4.

    Positioning follow

    • Representation: positioning_follow(r i ,r j ,[dist]).

    • Parameters: Requirements r i R, r j R and a minimum distance between the requirements, with a value always greater than zero.

    • Interpretation: It expresses that a requirement r i must be positioned at least dist releases after another requirement r j .

    • Formal interpretation: It is satisfied if at least one of the following conditions is met:

      (x i ,x j ≠0 and x i x j ≥dist) OR

      (x i =0 and x j ≠0).

  5. 5.

    Positioning before

    • Representation: positioning_before(r i ,k q ).

    • Parameters: Requirement r i R and a release k q .

    • Interpretation: It defines that a requirement r i may be assigned to any release before a specific release k q .

    • Formal interpretation: It is satisfied when the following conditions are met:

      (x i ≠0) AND (k q x i ≥1).

  6. 6.

    Positioning after

    • Representation: positioning_after(r i ,k q ).

    • Parameters: Requirement r i R and a release k q .

    • Interpretation: It defines that a requirement r i may be assigned to any release after a specific release k q .

    • Formal interpretation: It is satisfied when the following conditions are met:

      (x i ≠0) AND (x i k q ≥1).

  7. 7.

    Positioning in

    • Representation: positioning_in(r i ,k q ).

    • Parameters: Requirement r i R and a release k q .

    • Interpretation: It allows the DM to place a requirement r i in a specific release k q .

    • Formal interpretation: It is satisfied if, and only if, x i =k q .

  8. 8.

    Positioning no

    • Representation: positioning_no(r i ,k q ).

    • Parameters: Requirement r i R and a release k q .

    • Interpretation: It defines that a requirement r i should not be assigned in the release k q .

    • Formal interpretation: It is satisfied if, and only if, x i k q .

Therefore, the main contribution of this model is the inclusion of the DM’s preferences as one of the objectives to be optimized. Consider that T={t 1,t 2,…,t Z } is the set that represents the preference base, where Z is the number of preferences. Each preference t k is a pair composed of the preference type and an importance level, which represents how valuable a preference is to the DM: the preference based on one of the types previously presented and the importance level ranges from 1 to 10 to distinguish each preference in terms of relevance. For instance, t 1=<positioning_before(1,2),8> denotes that the DM wishes that requirement 1 be positioned before release 2 with an importance level value of 8. Therefore, the value of the objective related to the subjective preferences is measured as follows:

$$ {\text{Pref}(S,T) =} \left\{\begin{array}{ll} \left(\frac{{\sum\nolimits}_{i=1}^{Z} L_{i} \times \text{satisfy}(S,t_{i})}{{\sum\nolimits}_{i=1}^{Z}L_{i}} \right) & if\ T \neq \emptyset\\ 0, & \text{otherwise,} \end{array}\right. $$
(4)

where L i models the importance level defined by the DM for each respective preference t i . The satisfy(S,t i ) returns 1 if the solution S satisfies the preference t i and 0 otherwise. The objective Pref(S,T) is the unit percentage of the satisfied preferences’ level in the solution compared with the total of the importance levels of all the preferences present in the preference base. Thus, this metric measures how satisfied, using the S solution, the user’s preferences were.

It is important to highlight that the previous modeling does not define constraints capable of invalidating solutions that do not satisfy the DM’s preferences, but it provides soft constraints that guide the search process through regions on space of solutions that are more preferred by him/her.

Regarding the hard constraints, that is, the ones that limit feasible solutions, we have considered three types that will be described from now. Firstly, we considered the technical interdependence relations between the requirements, wich are revealed a priori in the requirements specification document. The constraint Precedence(S) deals with precedence and coupling relations between the requirements as a binary matrix DEP n×n , as follows:

$$ x_{i} \geq x_{j}, \forall i,j | DEP_{ij} = 1, $$
(5)

where DEP ij = 1, if the requirement r i depends on the requirement r j , 0 otherwise, and when DEP ij =DEP ji =1, the requirements r i and r j must be implemented in the same release. The remainder of the matrix is filled with 0, indicating that there is no relation between the requirements and no technical limitation on their position assignment.

Furthermore, the Budget(S) constraint treats the resources available for each release. Thus, considering that each requirement r i has a cost i and each release k p has a budget value b p , this constraint guarantees that the sum of the costs of all requirements allocated to each release does not exceed the correspondent budget:

$$\begin{array}{*{20}l} \sum\limits_{i=1}^{N}\text{inRelease}(x_{i},k_{j}) \times \text{cost}_{i} < b_{j},\\ \forall j \in \{1,2,3,\ldots,P \}, \\ \text{inRelease}(x_{i},j) = \left\{\begin{array}{ll} 1, & if\ x_{i} = j\\ 0, & \text{otherwise.} \end{array}\right. \end{array} $$

Finally, the constraint ReqForRelease(S) guarantees that each release j has at least one allocated requirement. This constraint is described below:

$$ \sum_{i=1}^{N} \text{inRelease}(x_{i},j) > 0. $$
(6)

Therefore, our multi-objective formulation of the release planning consists of:

$$ \begin{aligned} & \text{maximize} & & {\text{Satisfaction}}(S), \\ & \text{maximize} & & {\text{Pref}}(S,T), \\ & \text{minimize} & & \text{Risk}(S), \\ & \text{subject to:} & &1\text{)} \text{Precedence}(S), \\ & & & 2\text{)} \text{ReqForRelease}(S), \\ & & & 3\text{)} \text{Budget}(S).\\ \end{aligned} $$
(7)

Reference point method

An usual and challenging task associated with multi-objective problems is to decide which solution from the Pareto front will be chosen. The Pareto front is a set composed of the non-dominated solutions that represents the best trade-off among the objectives to be optimized [28]. Consequently, requesting the DM to analyze and choose a specific solution from this set can induce an excessive and additional cognitive effort. If there are more than two criteria in the problem, it may be difficult for the DM to analyze the large amount of information [29]. The first version of this work [12] proposed the use of the reference point method [13]. This method enables the DM to adjust different “aspiration levels” to each objective based on his/her preferences. These weights help the multi-objective technique to achieve solutions from the Pareto front suitable to the DM’s needs.

Given G objectives, Eq. 8 is used to normalize the values reached by the solution S for each objective i in the range between 0 and 1.

$$ \begin{aligned} \Delta f_{i}(S) = \frac{f_{i}(S)-o_{i}^{*}}{o_{i}^{\text{nad}} - o_{i}^{*}},\\ \end{aligned} $$
(8)

where the vectors \(O^{\text {nad}} = \left \{o_{1}^{\text {nad}},\ldots,o_{G}^{\text {nad}}\right \}\) and \(O^{*} = \left \{o_{1}^{*},\ldots,o_{G}^{*}\right \}\) express, respectively, the highest and lowest values reached by the Pareto front and f i (S) represents the fitness function value for each objective i.

Regarding the normalization, it is done in Eq. 8 only within the reference point method, which is only used when selecting a solution from the Pareto front, but the solutions can be properly seen through the visualization interface (Fig. 4). We highlight the normalization follows a interval [0,1], where Δ f i close to 0 represents that a solution S is more close to the best achieved value for this objective i.

The DM defines the aspiration level a i for each objective i. In our case, there are three objectives (Satisfaction, Pref, and Risk). The aspiration level is a weight that the DM specifies for each objective in order to subjectively differentiate each one. Supposing that the DM has 100 points available to distribute, he/she must decide how to allocate these points for each a i according to their importance. The a i values are used in the function MaxValue(S) that can be defined as

$$ \begin{aligned} \text{MaxValue}(S) = \max_{i=1,\ldots,G} \Delta f_{i}(S) \times \Delta q_{i},\\ \end{aligned} $$
(9)

where Δ q i =a i /100. The MaxValue(S) generates a balance between the Δ f i (S) and the DM’s opinion. Considering that DM’s opinion is represented by Δ q i and it is inversely proportional to Δ f i (S), a solution that fulfills the aspiration levels generates a low MaxValue. On the other hand, when the solution S does not satisfy the DM’s in one specific objective, it will have a Δ f i close to 1 that is multiplied by the Δ q i , generating a high MaxValue. Remembering that a high MaxValue implies a aspiration level to a objective which was not properly fulfilled.

As an hypothetical release planning scenario, consider that the DM has to distribute 100 points to the three objectives (Satisfaction(S), Pref(S), and Risk(S)). He/she assigned 34 points to the aspiration level a 1 and 33 points to a 2 and a 3. The multi-objective algorithm generates a Pareto front with E solutions, while a vector is created with E positions where each position represents a solution from the Pareto front, which is associated with a respective MaxValue defined by Eq. 9. Consequently, the solution that has the lowest MaxValue will be considered the one that meets the majority aspiration levels initially expressed by the DM. Finally, Eq. 10, also known as the scalarizing function, represents the solution search process from the Pareto front:

$$ \begin{aligned} & \text{minimize} & & \text{MaxValue}(S),\\ & \text{subject to:} & & S \in E. \\ \end{aligned} $$
(10)

Empirical study

The following sections present all of the details regarding the empirical study in which we followed some empirical software engineering guidelines, such as data collection procedure and quantitative results presentation [30, 31]. First, the experimental design specifications as well as research questions are presented. Then, the analysis and discussion of the achieved results are explained. Finally, the threats that may affect the validity of the experiments are emphasized.

Experimental design

The empirical study was divided into two different experiments, (a) automatic experiment and (b) participant-based experiment. Essentially, the first one aims to analyze the approach using different search-based algorithms, over artificial and real-world instances, while the second one aims to evaluate the use of the proposal in a real scenario composed of human evaluations.

Automatic experiment

Three instances were used in this experiment, named dataset-1, dataset-2, and dataset-3. Both dataset-1 and dataset-2 are based on real-world projects extracted from [32]. The first one is based on a word processor software and is composed of 50 requirements and 4 clients. The second one is based on a decision support system and has 25 requirements and 9 clients. Due to the limited size of these instances, we artificially generated dataset-3 with 600 requirements and 5 clients. After the preliminary experiments, we defined for each dataset the budget as 60% of the maximum release cost.

In addition, we evaluated two scenarios to analyze a different number of preferences and their impact on the optimization process. In the first one, called LowPrefs scenario, we randomly generated 10, 5, and 120 preferences to dataset-1, dataset-2, and dataset-3, respectively. To the HighPrefs scenario, we generated 50, 25, and 600 preferences, which is equivalent to the same number of requirements for each corresponding dataset.

Regarding the optimization techniques, we evaluated four of the most used evolutionary algorithms in the literature (NSGA-II, MOCell, IBEA, and SPEA-II) and a random search for a sanity test. All parameters were empirically obtained through preliminary tests and configured for all evolutionary techniques: 256 individuals, 400 generations, a crossover rate of 90%, and a 1% mutation rate. The Pareto front returned by the random search was generated after 102,400 solution evaluations. As suggested by Arcuri and Briand [33], we also executed each technique 30 times to deal with the stochastic nature of the meta-heuristics, collecting both quality metrics and respective averages from the obtained results.

We used an off-line procedure presented by Zhang [34] to generate a reference Pareto front (PFref), since the true (optimal) Pareto front (PFtrue) is unknown to the evaluated datasets. Consisting of the best solutions of each technique, the PFref denotes the best available approximation to the PFtrue. For each instance and each scenario, we executed each evolutionary technique 30 times considering 256 individuals and 1200 generations. Thus, we reached almost 9,216,000 solutions evaluated by each evolutionary algorithm, as well as another 9,216,000 solutions evaluated by the random search, achieving more than 46,000,000 evaluations. Finally, we considered PFref the best non-dominated solutions generated by all search techniques for each instance and each scenario.

The quality metrics collected and analyzed in this experiment were the hypervolume, spread, and generational distance. The hypervolume (HV) calculates the search area dominated by the Pareto front and defined from a distant point W [35]. Such a point is the worst for all objectives when compared to the solutions of all Pareto fronts being evaluated.

$$ \text{HV = volume} \left(\bigcup_{i=1}^{E} v_{i} \right), $$
(11)

where E is the set of solutions from the Pareto front to be evaluated and v i is the hypercube area formed between each solution s i and a far point W dominated by all solutions. Thus, the volume function calculates the occupied volume in the search space by the union of all the hypercubes v i . In summary, HV reflects the convergence and dispersion of the solutions regarding the PFref. Thus, the higher the value of this metric, the closer the known Pareto front is to the PFref.

Spread (SP) denotes the diversity accounted for by the known Pareto front. The closer to 0 this value is, the more distributed and sparser are the set of non-dominated solutions from the known Pareto front.

$$ \text{SP} = \frac{\sum_{g=1}^{G}h_{g}^{e} + \sum_{i=1}^{|E|}|h_{i} - \overline{h}|}{\sum_{g=1}^{G}h_{g}^{e} + |E|^{\overline{h}}}, $$
(12)

where G indicates the number of objectives of the problem, h i can be any distance measured between the neighboring solutions, and \(\overline {h}\) is the mean value between these distance measures. \(h_{g}^{e}\) is the distance between the extreme solutions of PFref and E, corresponding to the gth objective function.

Finally, the generational distance (GD) contributes to calculating the distance between the known Pareto front obtained by the optimization technique and the PFref.

$$ GD = \frac{\sqrt{\sum_{i=1}^{E}euc_{i}^{2}}}{n}, $$
(13)

where the value of e u c i is the smallest Euclidean distance of a solution iE to a solution from PFref.

Participant-based experiment

As previously mentioned, this experiment aims at observing the behavior and feasibility of the approach when it is used by software engineer practitioners. We chose NSGA-II from the automatic experiment due to its ability for generating solutions with good diversity, and it was employed with the same configurations used in the automatic experiment. To test our approach, we invited 10 participants to act as decision makers (DMs). First, a questionnaire with four simple questions was conducted to identify the general profile of each participant:

  • Q 1: What is your current professional occupation?

  • Q 2: How much experience do you have in the IT area?

  • Q 3: On a scale of low, medium, and high, how would you rate your experience as a Software Developer?

  • Q 4: On a scale of low, medium and high, how would you rate your experience with release planning?

From Table 2, all participants worked as a System Analyst or a Developer. The participants had between 1 and 21 years of experience in the software industry, resulting in a total of 71 years and an average of 7.1 years of experience. In relation to the IT experience, 50% of the participants selected “High” and no one selected “Low.” Regarding the experience with the release planning process, 30% of them assigned “High,” while 60% assigned “Medium,” and only one assigned “Low.” Consequently, we may assume that these results suggest a confidence level in the evaluations and feedbacks provided by the participants.

Table 2 Questionnaire answers from each participant

The participant-based experiment consists of four major stages. In the first stage, named as “Context Guidelines,” each participant was briefed about the task and scenario to be analyzed. Initially, we asked the participants (i) to perform the requirements engineer’s role in a company where a software to be developed is a word processor (described in dataset-1) and (ii) all details regarding the use of our tool and how their preferences may be expressed. Subsequently, we presented a simple requirements specification document about dataset-1 (see Fig. 2), including all requirement descriptions, budget constraint, weight values given by the clients to each requirement, and, finally, the relevance of each client to the company.

Fig. 2
figure 2

A sample piece of the requirements specification document

After concluding general explanations, we carried out the second stage (“Non-preferences”) attempting to present a solution without any preferences previously defined by the DM about the requirement allocation, i.e., considering just the Value and Risk objectives in the optimization process. Additionally, we asked the DM to weigh the objectives used in the reference point method to suggest a solution from the Pareto front. As illustrated in Fig. 3, a participant can adjust this weight configuration with a slider as he/she likes; see the requirements allocation throughout the releases, click “Optimize” to see other solutions or “Stop” when he/she is satisfied with the release plan. We also offer the opportunity to visualize the neighbor solutions of the one suggested by the reference point method. To obtain such a view, depicted in Fig. 4, the DM just has to click on “View” on the top menu.

Fig. 3
figure 3

GUI used in the “Non-preferences” stage of participant-based experiment

Fig. 4
figure 4

Interface for visualizing solutions from the Pareto front

After deciding which solution is suitable to his/her needs, the DM initiates the third stage called “Preferences Set.” In this phase, we included the DM’s preferences as an objective to be optimized. As Fig. 5 shows, the participants find information about the preference base on the right side of the window, including how to manage his/her preferences and to check which ones were able to be included in the suggested solution considering the weights configuration. Figure 6 exemplifies and shows the specifications required by the tool for the DM to express his/her preferences, as well as the importance level for each preference. Similar to the previous stage, the participant continuously interacts with the system until a solution is considered satisfactory. However, the main difference concerns the possibility of the DM to insert and manipulate his/her preferences.

Fig. 5
figure 5

GUI used in the “Preferences Set” stage of participant-based experiment

Fig. 6
figure 6

Example of preferences defined by the DM. a Coupling Joint. b Positioning Before. c Positioning Follow

Lastly, in the fourth stage (“Feedbacks”), we obtained feedback about how convinced the participants were about the subjective satisfaction provided by both solutions selected in stages 2 and 3, respectively called non-preference-based and preference-based solutions. As seen in Fig. 7, the participant simply estimates such evaluation to each solution on a scale of “Very ineffective,” “Ineffective,” “Indifferent,” “Effective,” and “Very effective.”

Fig. 7
figure 7

GUI presented to the DM in the fourth stage, i.e., “Feedbacks”

Research questions

Three research questions were asked to assess and analyze our approach behavior. They are presented as follows:

  • RQ 1: Which search-based techniques, among the evaluated ones, produce better solutions?

  • RQ 2: Which is the subjective benefit when considering the DM’s preferences as an objective to be optimized?

  • RQ 3: Which is the relation between the inclusion of the preferences and the DM’s subjective evaluation in the final solution?

Results and analysis

The results of the empirical study are presented in this section by analyzing the previous three research questions.

  • RQ 1: Which search-based techniques, among the evaluated ones, produce better solutions?

Aiming to answer this question, we analyzed the results produced by all algorithms for each instance and the scenario previously presented in the automatic experiment design.

Table 3 presents the values of statistical tests considering the metrics hypervolume (HV), spread (SP), and generational distance (GD) for each dataset. The Wilcoxon test (WC) was applied, using the Bonferroni correction, to identify the occurrence of statistical differences among the samples considering a confidence level of 95%. The Vargha-Delaney  12 test was used to measure the effect size, in other words, the relative number of times that an algorithm produced superior values over another. Further details about each statistical test may be seen in the work of Arcuri and Briand [33].  12 measures the probability that a technique (table line) with a particular parameter setting yields a higher result than another technique (table column). For instance, considering MOCell (1) and IBEA (2), the probability of MOCell returning a front better than IBEA’s one according to the GD metric for dataset-1 in HighPrefs is 82.7% given that the corresponding  12 is 0.827.

Table 3 Wilcoxon and Vargha-Delaney statistical values from all search algorithms, considering all metrics on each scenario

Observing the data from Table 3, it is noticeable that there was a statistical difference in most of the comparisons, except for 14.4% of the times in which there was no statistical difference (values in italic).

Before analyzing the results obtained from the GD metric, it is important to highlight that a value close to 0 is desirable because it indicates that the Pareto front is closer to the PFref. Thus, looking at dataset-1 and LowPref scenario, we can note that IBEA achieved better results of GD, due to  12 being 1 when the other algorithms are compared with IBEA. This means that 100% of the times, other techniques produced results higher than IBEA, for the GD metric. Such a behavior is verified in almost all of the scenarios. However, in dataset-1, when the number of preferences is high, IBEA lost to NSGA-II. Therefore, in general, IBEA outperforms all the other algorithms in terms of GD, and NSGA-II is the second best search technique among the evaluated ones.

As noted in the “Automatic experiment” section, the metric SP measures the dispersion between the Pareto front solutions. In practice, lower values for this metric bring about more uniformly distributed solutions from the front. Thus, observing the statistical test results for SP, we can see that IBEA achieves the worst results 100% of the times in comparison with all the other techniques for each scenario of dataset-1 and dataset-2. This observation indicates that even solutions returned by IBEA are next to the reference Pareto front. However, these solutions are not well distributed in the search space. On the other hand, SPEA-II obtained the best results in SP for all instances, although no statistical difference was verified in dataset-1 and dataset-2 with low preferences when compared with MOCell.

Observing both approximations to the PFref (convergence) and diversity, hypervolume (HV) is essential for evaluating the multi-objective algorithms. Because the HV value is near 1, the result is better. Thus, analyzing the results from this metric for dataset-1 and dataset-2, which are based on real data, NSGA-II shows a better performance in both datasets. For instance, in almost all of the scenarios, NSGA-II and SPEA-II achieved the highest HV values. Only in dataset-1 and LowPrefs scenario did MOCell reach the higher HV values than SPEA-II and with no statistical difference from NSGA-II, while for dataset-3 and Low and High scenarios, IBEA outperformed all algorithms in more than 90% of runs.

Figure 8 shows a comparison between the best Pareto fronts obtained by each search technique for dataset-3 on the HighPrefs scenario taking into account the hypervolume metric considering 30 runs.

Fig. 8
figure 8

The best Pareto fronts produced by each algorithm for datatset-3 and HighPrefs scenario on 30 runs, considering only HV metric

Notice that in Fig. 8, all of the algorithms present different distributions of solutions. Regarding the HV metric, IBEA returns solutions that are more concentrated and closer to PFref. Among the evaluated algorithms, MOCell returns solutions that are nearer to the extreme points from the PFref and thus has good diversity. In addition, in that case, although SPEA-II provides solutions with better convergence than NSGA-II and MOCell, its front does not have good dispersion. Consequently, it represents a bad diversity. Finally, it is self-evident that random search was inferior to all of the evolutionary techniques investigated in this work.

Figure 9 shows the comparison of all the investigated search techniques considering the average calculated from each evaluated metric for all the scenarios and datasets. Because the metrics have different ranges of values, their results were normalized to the [0,1] interval.

Fig. 9
figure 9

Comparison between all search techniques considering metrics HV, SP, and GD

We noticed that, on average, NSGA-II obtained the best results in HV. However, the difference from other evolutionary techniques was not high. Regarding the SP, SPEA-II considerably outperformed all of the other algorithms, followed by random search and MOCell. Finally, observing the GD results, the best algorithm was IBEA, followed by NSGA-II and SPEA-II.

Regarding the execution time of the meta-heuristics, NSGA-II and MOCell obtained the smallest execution times with a little difference between them. To determine which algorithm presents a better time performance, a statistical test was performed for these two algorithms shown in Table 4. Thus, NSGA-II achieved better time results for dataset-3 considering both scenarios and dataset-1 considering only the HighPrefs scenario, while MOCell was shown to be superior for dataset-2 considering both scenarios and dataset-1 in the LowPrefs scenario. However, even though NSGA-II is better in some scenarios and MOCell in others, the magnitude of the differences in time between these algorithms was small. The greatest difference among all the meta-heuristics for all instances is 151,344 ms. Thus, due to this magnitude of time, we considered the execution time irrelevant in comparison with the Pareto front quality metrics.

Table 4 Wilcoxon and Vargha-Delaney statistical values from NSGA-II and MOCell, considering the execution times

In summary, IBEA has proven to be a good choice because it presents qualified solutions for all the datasets, as demonstrated by GD. However, its solutions do not present a wide coverage of the search space for instances with a low number of requirements. If the diversity in Pareto front is a required aspect, NSGA-II is more recommended for scenarios composed of a small number of requirements.

To support replication, all the datasets, results, and source code are available online for public access1. In addition, an interactive 3D chart version of Fig. 8 is available to provide a better visualization of Pareto front for all the scenarios.

  • RQ 2: Which is the subjective benefit when considering the DM’s preferences as an objective to be optimized?

To evaluate such a subjective benefit, we analyzed the subjective evaluation provided by each participant to the non-preference-based and preference-based solutions from the participant-based experiment. Figure 10 depicts that 7 out of 10 participants evaluated the preference-based as solution more satisfactory than the non-preference-based solution. In addition, more than half of the participants considered the preference-based solution as “Effective” or “Very Effective,” while 5 participants judged the non-preference-based solution as “Ineffective” regarding their subjectivity interests. Only participant #2 negatively evaluated the preference-based solution as “Ineffective.” The relation between the inclusion of the preferences and the subjective evaluation will be investigated in the next research question.

Fig. 10
figure 10

Subjective evaluations defined by the participants to the non preference-based and preference-based solutions

Therefore, answering RQ2, such analysis suggests a considerable benefit when considering the DM’s preferences as an objective of the optimization process. In addition, in Table 5, we provide more information, i.e., the number of preferences added and time spent to perform the experiment for each participant.

Table 5 Number of preferences and time for each participant on the experiment

On average, there were 5 preferences per participant. As can be seen, participant #3 added the smallest number of preferences, only one, while participant #9 added 13 preferences, the greatest quantity. In relation to the time taken to perform the experiment, each participant took on average 23 min. Participant #7 took the greatest amount of time, 35 min, while #8 was the one who took the smallest amount of time, 5 min.

  • RQ 3: Which is the relation between the inclusion of the preferences and the DM’s subjective evaluation regarding the final solution?

The subjective evaluation is provided by each participant to the preference-based solutions, while the value of the Pref(T,S) objective is obtained by Eq. (4). As seen in Fig. 11, we trace an adjusting line through the solutions, indicating that there is a correlation between such variables. However, a metric was necessary to evaluate the intensity of this relation because some solutions were visually far apart.

Fig. 11
figure 11

Distribution of preference-based solutions based on subjective evaluation and Pref

To evaluate the correlation between the subjective evaluation and Pref, we used the Spearman’s rank coefficient [36]. This metric is a uniform value between −1 and 1 and indicates no correlation if this value is equal to 0. In addition, the farther from 0 this value is, the more correlated the series are. The value r s calculated for the data series is 0.74, indicating that these values are directly proportional. The two-tailed p value of 0.0144 also suggests that this correlation is statistically significant with a significance level α=0.05.

In Fig. 12, the subjective evaluation values are often close to the corresponding Pref ones. The cases in which they are not may suggest that there are other aspects that influence the DM’s satisfaction and are not covered by the available types of preferences in this work.

Fig. 12
figure 12

Relation between subjective evaluation and the Pref value

Regarding the analysis presented above, we can conclude that there is a directly proportional relation between the subjective evaluation and the objective Pref. However, some samples suggest that the preference types adopted may not cover all of the DM’s wishes.

Participants’ feedback

At the end of the participant-based experiment, each participant was invited to answer a feedback questionnaire about the experience using the tool. We asked four questions covering different aspects of usability (three objective questions and one subjective question):

  • Q 1: How efficient do you judge the experience at interactively assisting the tool to plan the releases to be?

  • Q 2: How much easier was it to express your opinions considering the available preferences?

  • Q 3: Would you use this tool in your workplace?

  • Q 4: What changes would you suggest regarding the tool interface?

First, for Q 1, 80% of the participants selected “Effective” or “Very effective” on a scale of “Very ineffective,” “Ineffective,” “Indifferent,” “Effective,” or “Very effective.” Complementing such a result, for Q 2, 50% considered “Easy” to express the preferences on a scale of “Very hard,” “Hard,” “Indifferent,” “Easy,” and “Very easy.” These answers reinforce the conclusions achieved in RQ 3 about the subjective benefit from considering the DM’s preferences in the optimization process.

We used a scale from 1 (“No way”) to 5 (“Certainly”) for Q 3. Four participants rated with 5, another four with 3, and only one with 2. This feedback encourages the investigation of the presented tool in a real-world scenario of release planning.

Regarding the subjective question (Q 4), the answers were generally divided between improving the requirements allocation visualization and providing a better way to adjust the weight configuration for each one of the objectives.

Threats to validity

Below, we discuss the threats to the validity of our empirical evaluation, classifying them into Internal, External, Construction and Conclusion [37].

Taking into account the internal characteristics of the experiments, we have to notice that preliminary tests were carried out for defining the search technique parametrization. However, some specific settings on a given algorithm can obtain better results for some instances. Despite the fact that two datasets were based on real data, some information was necessary to be randomly generated (risk values and number of releases), that is, they do not represent a fully real-world scenario. The risk of implementing each requirement, which did not originally exist, was manually defined by a Developer and appended to the instances. The number of releases was changed from 3 to 5 and 8 for dataset-1 and dataset-2, respectively. This choice was made to increase the variation of the DM’s preferences.

Regarding the participant-based experiment, the participants may have changed their behavior since they knew that they were under evaluation, corroborating the Hawthorne’s effect [38]. To mitigate this problem, the participants received an explanation about the approach but not about the assumptions that were under investigation.

We believe that our empirical study has a weakness regarding the generalization of the achieved results. For instance, the datasets based on real information have few requirements, which makes it hard to conclude that the results would be similar for large-scale instances. Such a circumstance was the motivation to generate and use the artificial dataset with a large number of requirements. A similar problem is encountered in the participant-based experiment because the number of participants was not high enough to represent expressive scenarios.

Concerning the experiment construction threats, the metrics that we used to estimate client’s satisfaction and overall risk are based on values that are defined a priori by the development team. These estimated values may vary as the project goes on, which requires rerunning the approach to adapt these changes. Despite this limitation, this strategy is widely used in the literature, such as [3, 39, 40]. In addition, it is known that meta-heuristics can vary their final solutions and execution time according to the instance. Unfortunately, we did not investigate time concerns. In addition, there is no longer an explanation about the evaluated metrics. Nevertheless, all of them have been widely used in the related works as well as the multi-objective optimization literature. Still, considering the participant-based experiment, the major metric used to measure the satisfaction of each participant was a subjective evaluation provided for the final solution following a scale of “Very ineffective,” “Ineffective,” “Indifferent,” “Effective,” and “Very effective.” This feedback may not properly represent the DM’s feeling.

Finally, the threats to the experiment conclusions’ validity are mainly related to the characteristics of the algorithms that were investigated. Meta-heuristics present a stochastic behavior, and thus, distinct runs may produce different results for the same problem. Aiming at minimizing such a weakness, for each combination between datasets, scenarios and the DM’s preferences for the search algorithms were executed 30 times in the automatic experiment. Given all obtained results, we conducted statistical analyses as recommended by Arcuri and Briand [41]. Analyzing the conclusions of the participant-based experiment, some of them may be affected by each participant’s understanding level after receiving an explanation about the study as well as their experience on release planning using automatic tools.

Conclusions

Release planning is one of the most complex and relevant activities performed in the iterative and incremental software development process. Recently, the SBSE approaches have been discussed based on the strength of the computational intelligence with the human expertise. In other words, it allows the search process to be guided by the human’s knowledge and, consequently, provide valuable solutions to the decision maker (DM). Thus, we claim the importance of providing a mechanism to capture the DM preferences in a broader scope, instead of just requiring a weight factor, for instance. Besides increasing the human’s engagement, he/she will progressively gain more consciousness of how feasible the preferences are.

The evaluated multi-objective approach consists of treating the human’s preferences as another objective to be maximized, as well as maximizing the overall client satisfaction and minimizing the project risk. In sum, the DM defines a set of preferences about the requirements allocation, which are stored in a preference base responsible for influencing the search process.

Therefore, we have significantly extended our previous work through the accomplishment of new experimental analysis considering both simulated and real human evaluations. The automatic experiment points out that NSGA-II obtained overall superiority in two of the three datasets investigated, positioning itself as a good search technique for smaller scenarios, while IBEA showed a better performance for large datasets, since the loss in the initial diversity of the algorithm decreases as the number of requirements increases. Regarding the participant-based experiment, it was found that two thirds of the participants evaluated the preference-based solution better than the non-preference-based one, encouraging the investigation of the presented tool in a real-world scenario of release planning. In addition, we made a novel tool for the release planning process to be able to incorporate the human preferences during the optimization process1.

As future work, we intend to evolve our GUI to provide a more intuitive interaction, solutions visualization and preferences specification by the DM. We also intend to compare our approach with other search-based proposals, which explore the human’s preferences in the optimization process.

Endnote

1 Webpage: http://goes.uece.br/raphaelsaraiva/multi4rp/en/.