An Autonomous Performance Testing Framework using Self-Adaptive Fuzzy Reinforcement Learning

Test automation can result in reduction in cost and human effort. If the optimal policy, the course of actions taken, for the intended objective in a testing process could be learnt by the testing system (e.g., a smart tester agent), then it could be reused in similar situations, thus leading to higher efficiency, i.e., less computational time. Automating stress testing to find performance breaking points remains a challenge for complex software systems. Common approaches are mainly based on source code or system model analysis or use-case based techniques. However, source code or system models might not be available at testing time. In this paper, we propose a self-adaptive fuzzy reinforcement learning-based performance (stress) testing framework (SaFReL) that enables the tester agent to learn the optimal policy for generating stress test cases leading to performance breaking point without access to performance model of the system under test. SaFReL learns the optimal policy through an initial learning, then reuses it during a transfer learning phase, while keeping the learning running in the long-term. Through multiple experiments on a simulated environment, we demonstrate that our approach generates the stress test cases for different programs efficiently and adaptively without access to performance models.

Q-learning is the main learning procedure, moreover, to improve the efficacy of the learning in terms of efficiency and adaptivity to various software systems we formulated the state space of the systems as a fuzzy state space and used the adaptive action selection strategy. Fuzzy state modelling was applied to deal with the issue of uncertainty in defining sharp values and boundaries in state modelling and adaptive action selection strategy was intended to make the learning adaptive to conditions. The proposed test framework assumes two phases of learning, i.e., initial and transfer learning. In the initial learning phase, it learns the optimal policy initially upon the first SUT and in the transfer learning it reuses the learnt policy for observed SUTs with performance sensitivity analogous to already observed ones while still keeping the learning running in the long-term. We demonstrated the efficacy of our approach on different homogeneous and heterogeneous sets of SUTs regarding the performance sensitivity of the software programs. SUTs used for experimental evaluation were a range of well-known benchmarks, as single thread programs. Experimentation was based on simulating the performance behavior of benchmarks as SUTs characterized with various initial amounts of resources and also different values of response time requirements according to different application contexts. In particular, this approach led to reduction in the cost of generating stress test conditions (test case generation) by reusing the learnt policy upon SUTs with detected similarity to already observed ones. Moreover, it could adapt its operational strategy to various SUTs with different performance sensitivity effectively while preserving the efficiency.
The rest of the paper is organized as follows: Section 2 discusses the background concepts and the motivation for proposing the smart performance testing framework based on an adaptive learning technique. Section 3 presents an overview of the architecture of the proposed test framework while the technical details of the constituent parts are described in Section 4 and 5. Section 6 explains the self-adaptive learning-based functionality of the framework. Section 7 presents the experimental evaluation and the results. A discussion about the results and lessons learned are presented in Section 8. Section 9 provides a review on the related work, and finally Section 10 concludes the paper and discusses some future directions.

Motivation and Background
Performance analysis which can be realized through modeling and testing is of great importance for performancecritical software systems in various domains. Assessing the performance of the system based on the performance requirements, detecting the functional problems occurring under certain execution conditions, and the violations of non-functional requirements, under normal and stress execution conditions are the main targets of performance analysis. Verifying performance behavior of software systems under stress conditions, which is called stress testing, to assess robustness and find performance breaking points of the system, is one of the main activities often involved in performance testing.
Violations of performance requirements are considered as anomalies in the performance behavior of a software system. A performance anomaly or performance requirement violation occurs due to the occurrence of a performance bottleneck. [25]. The behavior of a bottleneck is due to some limitations associated with the component such as saturation and contention which make a component act as a bottleneck. A system or resource component saturation happens upon full utilization of its capacity or when the utilization exceeds a usage threshold [25]. Capacity expresses the maximum of processing power, service giving rate or the storage size. Contention occurs when multiple processes contend for accessing a limited number of shared components including resource components like CPU cycles, memory and disk or software (application-level) components.
Olumuyiwa Ibidunmoye et al. [22] described the causes of performance bottleneck emergence in terms of primary and secondary causes. Primary causes express what results in the emergence of a bottleneck component and secondary causes show the limitations (critical conditions, i.e., saturation, contention and failure) associated with the bottleneck components. Their categorization of the primary causes can be summarized as three groups, i.e., application-wise, platform-wise and workload-wise cause-effect relationships. Application-wise cause-effects express issues such as buggy codes and systems architecture faults. Platform-wise cause-effects describe the issues related to hardware resources, operating system and execution environment. High deviations from the expected workload intensity and similar issues described as workload burstiness were labeled as workload-wise cause-effects.
Regardless of the causes of performance bottlenecks, detecting the performance requirements violation and finding the performance breaking points is a challenge, particularly for complex software systems. For addressing this challenge, we need to find how to provide the critical execution conditions which make the performance bottlenecks happen. The focus of performance testing in terms of stress testing in our case, is verifying the robustness of the system and finding the performance breaking point based on manipulating the external causes of performance bottlenecks including platform-wise and workload-wise ones. However, the effects of internal causes, i.e., application/architecture-wise ones are different, and they could be varying due to continuous changes and updates on the software system (Continuous Integration /Continuous Delivery process) and even be various upon different platforms and execution environments. Therefore, the complexity of the SUT and internal affecting factors makes it hard to build a precise performance model expressing the effects of the causes. This is a major barrier motivating for using model-free learning-based approaches in which the optimal policy for addressing the problem could be learnt indirectly through interaction with the system without having the model of the system. In this case, using a modelfree learning-based approach, it can be learnt how to manipulate the external causes for various states (architectures/variants) of the SUT. The learnt policy could be also reused for other SUTs with similar performance sensitivity to resources.

Reinforcement Learning
The adaptive model-free learning-based stress testing proposed in our smart framework is an interactive testing focused on having an initial learning first, then reuse the learnt policy during a transfer learning for further SUTs with similar performance sensitivity. In the proposed framework, the tester agent interacts with one SUT and learns how to apply stress test conditions in terms of manipulating the resource availability, to find the performance breaking point, then it reuses the learnt policy for other SUTs with similarity in the type of performance sensitivity.
Reinforcement learning (RL) [39] is an interactive semi-supervised learning which has been frequently applied to building self-adaptive smart systems. It involves continuous interaction between the agent (learner) and the system which is controlled. The system under control in our case is the SUT. The agent senses continuously the state of the system, which is modelled in terms of response time and resource utilization in this paper, then it selects an action, which is modifying the available resources, to be applied to the system. It receives a reinforcement signal as a scalar reward which shows the effectiveness of the applied action. The final objective of the agent in an RL problem is to find a policy which maximizes the total cumulative reward over the time. The agent uses an action selection strategy based on a combination of trying out the available actions, i.e., exploration, and relying on the previously achieved experience to select the highly-valued actions, which is called exploitation of experience. For applying an RL approach to a problem, it is generally supposed that the environment is non-deterministic and also stationary upon transitions between the states of the system. Q-learning [39] algorithms is a well-known family of model-free RL algorithms in which the utility value of the long-term cumulative reward associated to the pairs of actions and states is learnt. It is off-policy which means that the agent learns the optimal policy regardless of how the agent explored the environment. After learning the optimal policy, the agent replays the learnt policy while still occasionally exploring the action space and trying out the unexperienced actions.

System Architecture
This section provides a general overview of the architecture of our smart stress test case generation framework for different types of software systems. Fig. 1, shows the architecture of the proposed self-adaptive fuzzy reinforcement learning-based (SaFReL) stress testing. It contains the following main parts: Fig. 1 SaFReL architecture 1. Fuzzy State Detection. State detection, which is conducted by green components in Fig.1, is one of the main steps of reinforcement learning. The agent observes the current state of the controlled system, i.e., SUT in this case, at discrete time steps. The agent measures the values of four quality metrics, i.e., 1) response time, and utilization improvements of 2) CPU, 3) memory, and 4) disk to estimate the status of the SUT. They show how much the agent has been able to put the system in a stress condition. These measurements combined with fuzzy rules in a fuzzy inference engine are used to classify the state of the system into fuzzy states. The ordinary approach in RL problems is dividing the state space of the system into multiple mutually exclusive states. At each time, the system must be at one distinct state. The relevant challenges of such crisp categorization include knowing how much a value is suitable to be a threshold for categories of a metric, and how we can treat the boundary values between categories. Fuzzy logic using fuzzy rules and fuzzy operations could be used for addressing these challenges. We used fuzzy classification as a soft labeling technique for presenting the values of the metrics expressing the state of the system. Detecting the fuzzy state of the SUT is done by fuzzification, fuzzy inference and rule base modules in our framework for smart stress testing.
2. Action Applying and Strategy Adaptation. The action selection, stored experience, and strategy adaptation modules, shown as yellow components in Fig.1, form the action applier and strategy adapter of our framework. The smart agent uses reinforcement learning as a semi-supervised learning procedure to modify the platform-wise affecting factors and subsequently provides the test conditions. After each step of state detection, the agent recommends an action, e.g. adding or removing a fraction of available resources. The detected fuzzy state and the computed reward of the applied action in the previous step, are the inputs to the action applier part of the framework, which is composed of action selection and stored experience modules. The tester agent selects the actions based on an action selection strategy providing a trade-off between the exploration of the available actions and also exploitation of the stored experience. In other words, the action selection strategy determines to which degree the agent should try out the available actions or select high-valued actions based on the stored experience. The smart tester agent was augmented by a strategy adaptation module which is responsible for adapting the degree of exploration and exploitation in the action selection strategy to the various types of observed SUTs. Software programs have different levels of sensitivity to the resources as one of the main factors affecting performance. SUTs with different performance sensitivity to resources, i.e., CPU-intensive, memory-intensive or diskintensive, will react to the modification of resources differently. Therefore, when the agent observes a SUT instance which is different from the previously observed instances in terms of the performance sensitivity, the strategy adaptation module tries to put more weights on exploration rather than exploitation. An indication of the type of resource sensitivity of SUT regarding being CPU-intensive, memory-intensive or disk-intensive is the input to strategy adaptation module.
3. Reward Computation. This part which has been shown by a red block in Fig.1 is responsible for calculating the reward (reinforcement) signal of the applied actions. It shows the desirability and effectiveness of the actions for moving towards meeting the learning goal.

Fuzzy State Detection
Detecting the state of the system during the learning was based on measuring four quality metrics. The resources utilization improvement and SUT response time were representatives to indicate the state of the system. As shown in Fig. 2, a three-dimensional space of parameters representing the resource usage improvement with response time of SUT, model the state space of the system. Due to uncertainty in defining sharp boundaries in the state space, fuzzy classification was applied to specify the categories. Therefore, the system can be at one or more states at the same time with various degrees of certainty. We applied fuzzy inference [40,41] to detect the state of the system at each time step, and defined the required steps and operations including input fuzzification, fuzzy rules, fuzzy operators, implication method, in section 4.1, for the fuzzy state inference process.

Fuzzy State Space Modelling
As described in the previous section, a set of quality measurements, consisting of CPU, memory and disk utilization improvements and response time of the SUT indicates the state of the system. The values of these measurements are not bounded, then for simplifying the inference and facilitating the exploration of the state space, we normalized the values of these parameters to interval [0, 1] using the following functions:  Fuzzy Rules. After fuzzification of inputs, fuzzy rules, which are made based on the domain knowledge, facilitate the fuzzy inferencing, i.e., reasoning about the possible states that the system assumes. The fuzzy rules, as shown in Eq. 3, consist of two parts, i.e., antecedent and consequent. The former is a combination of linguistic terms of the input parameters, i.e., normalized quality measurement parameters and the consequent is a fuzzy set with a membership function showing to what extent the system is in the associated state.
Rl1: If CUI is High AND MUI is High AND DUI is Low AND RT is Acceptable, then State is HHLA. ( Fuzzy Operators. If the antecedents of the rules are made of multiple linguistic terms, then fuzzy operators are applied to achieve one number showing the support or activation degree of the rule. Two of the well-known methods for fuzzy AND operator are minimum (min) and product (prod). In our case, we used method min for fuzzy AND operation. It shows that given a set of input parameters, A, the support degree of rule Ri is given as .R = min U G ( U ) where U is an input parameter in A and is its associated fuzzy set in the rule Ri. Implication Method. After obtaining the membership degree for the antecedent, the membership function of the consequent is also reshaped using an implication method. There are two well-known methods for implication process including minimum (min) and product (prod) which truncates and scales the membership function of the output fuzzy set respectively. The membership degree of the antecedent is given as input to the implication method. We used method min as implication method. In our case, based on the number of fuzzy sets specified over the domains of the quality parameters we had 24 rules in total, then, according to the output of the inference system as a set of state-degree pairs, {( # , # )}, the most effective rule with max R # . @ is selected to decide the final fuzzy state of the system. Fig. 3 shows the representation of the fuzzy states of the system.

Adaptive Action Selection and Reward Computation
Actions. The set of applied actions involves the operations changing the platform-wise factors affecting the performance, i.e., the available resources such as compute (CPU), memory and disk capacity. In the current prototype, the set of actions are made of operations reducing the available resource capacity which are as follows: where # , # , # and # represent the set of actions, the current available compute (CPU), memory and disk capacity at time step n respectively. Then, the list of actions is as shown in Table 1. and Softmax are well-known methods used for action selection in RL algorithms. They are intended to provide tradeoffs between exploration and exploitation of the achieved experience over the state-action space. In SaFReL, we used e-greedy as the action selection strategy basically and the proposed strategy adaptation feature acts as a simple metalearning algorithm intended to improve the performance of the learning by applying adaptive changes to the action selection strategy based on detected differences between the observed SUTs. Upon observing a SUT instance with a performance sensitivity different from the already observed ones, it adjusts the value of parameter e to make the agent do the action selection based on exploration more. While it makes the agent strive for more exploitation of the stored experience, upon detection of SUT instances which are almost similar to the previous ones.
It recognizes the similarity between SUT instances using calculating Cosine Similarity between the performance sensitivity indicators of SUT instances, that is using Eq. 7.
where d represents the sensitivity vector of the kth SUT instance and R d represents the ith component of vector d . The sensitivity vector contains the values of the sensitivity indications of the SUT instance to resources.
Reward Signal. The agent receives a reward signal indicating the effectiveness and desirability of the applied action in each learning step. We derived a utility function as a weighted linear combination of two functions expressing the response time deviation and resource usage which is as follows: where # " shows the deviation of the response time from the response time requirement, # † indicates the efficiency of the resource usage, and , 0 ≤ ≤ 1 is a parameter allowing to prioritize different aspects of target stress conditions, i.e., response time deviation or limited resource availability. # " showing the compliance of the response time to the requirement was defined as follows: where ? and ‰ are the response time requirement and breaking point threshold of the SUT respectively. † ( ) representing the resource utilization in the reward signal, is also a weighted combination of the resource utilization values. It was defined using the following equation:

Performance Testing using Self-Adaptive Fuzzy Reinforcement Learning
In this section, we describe the details of the operational procedure of our framework, SaFReL, to provide stress test conditions for various types of SUTs. A tester agent with SaFReL architecture, learns how to generate stress test conditions for different types of software systems automatically. The main steps of a learning trial for a SUT instance is as follows: The agent measures the quality parameters and receives the state-membership degree pair ( # , # ) from the fuzzy state detection part, where # is the fuzzy state of the system and # shows to what extent the system has assumed that state. The output of the state detection is a fuzzy state with the maximum membership degree. Then, the agent selects one action, # ∈ # based on the stored experience or through exploring the set of available actions. The agent applies the selected action to the system and re-executes the SUT. In the next step, after the re-execution of the SUT, the agent detects the new state of the SUT, ( #‹, , #‹, ) and receives a reward signal, #‹, ∈ ℝ, showing the effectiveness of the applied action. Then, after detecting the new state and receiving the reward, it updates the stored experience. The whole procedure is repeated until meeting the stopping criteria. In SaFReL, as a smart stress test case generation, the stopping condition was defined as reaching the performance breaking point threshold of SUT, in our case as 1.5 times higher than the response time requirement. The experience of the agent is defined as the policy which the agent learns. A policy is a mapping between each state and action and specifies the probability of taking action a in a given state s. Thus, the final objective of the learning is to find a policy maximizing the expected long-term reward achieved over the further learning trials which is as follows [39]: where is a discount factor specifying to what extent the agent gives weight to the future rewards compared to the immediate one. We used Q-learning as a model-free RL algorithm in our framework. In Q-Learning, a utility value p ( , ) is allocated to each pair of state and action which is defined as follows [39]: The q-values, p ( , ), form the experience base of the agent, on which the agent relies for the action selection. The q-values are updated incrementally during the learning via an incremental differencing. Given the fuzzy state detection, we made a slight change in the typical updating of q-values using the following heuristic to include the membership degree of the detected state: Where , 0 ≤ ≤ 1 is the learning rate which adjusts to what extent the new utility values affect the previous qvalues. Finally, the agent finds the optimal policy which suggests the action maximizing the utility value for a given state s: The agent selects the action based on Eq.14 when it is supposed to exploit the stored experience. SaFReL assumes two learning phases, i.e., initial learning and transfer learning. Initial learning. Initial learning occurs during the interaction with the first observed SUT instance. The initial convergence of the experience base takes place upon the learning episodes of the initial learning. The agent stores the obtained experience. It repeats the learning episode multiple times on the first SUT instance to achieve the initial convergence of its experience base. Transfer learning. SaFReL goes through the transfer learning phase, after the initial convergence of the experience base (initial learning). During this phase it keeps the learning running, uses the learnt policy upon observing SUT instances with similar performance sensitivity, and updates the experience base upon detecting new SUT instances with performance sensitivity different from the previously observed ones. Strategy adaptation is used in the transfer learning phase and helps the agent adapts to varying SUT instances. Listing 1 and 2 depict the entire operational procedure of SaFReL including both initial learning and transfer learning phases.
Performance of learning. Different action selection strategies like e-greedy with various e-values and Softmax methods based on Boltzmann distribution, could be used as action selection strategy. They provide different types of trade-offs between exploration and experience exploitation, which could impact the efficiency of the learning e.g., in terms of convergence speed. Moreover, choosing different values for learning parameters including learning rate a and discount factor g could affect the performance of the learning differently.

Algorithm 1. SaFReL: Self-adaptive Fuzzy Reinforcement Learning-based Stress Testing
Required: S, A, a, g Initialize q-values, ( , ) = 0 ∀ ∈ , ∀ ∈ and ɛ = υ, 0 < υ < 1 2. Select an action using the action selection strategy (e.g. ɛ-greedy: select an= •∈oe ( # , ) with probability (1-ɛ) or a random d , d ∈ with probability ɛ) 3. Apply the selected action, re-execute the SUT 4. Detect the new fuzzy state-degree ( #‹, , #‹, ) of the system regarding the applied action. 5. Receive the reward signal, Rn+1 6. Update the q-value of the pair of previous state and applied action by 7. Repeat until the end of learning episode, i.e., meeting the stopping criteria (violation of the performance requirement or finding performance breaking point)

Evaluation
In this section, we present the experimental evaluation of the proposed self-adaptive fuzzy reinforcement learningbased stress testing framework, SaFReL. We assessed the efficacy and applicability of SaFRel on different types of SUT programs. The goal of doing this experimental evaluation is to assess how SaFReL can conduct stress testing efficiently on different software programs without having a performance model of the system. The following subsections describe the proposed setup for doing experiments, the metrics defined, and two analysis scenarios designed to evaluate the efficacy of SaFReL.

Experiments Setup
In this study, all the experiments were done on an experimental testbed environment simulating the performance behavior of different software programs. The testbed environment works based on using an analysis of the resource usage signature of different software programs which was obtained from [42]. The analysis proposed by Taheri et al. [42] quantifies the sensitivity of programs to three different resources, i.e., CPU, memory and disk, using a profiling method. We referred to these sensitivity values as 7 , ; and = , respectively in previous sections. For example, a value of 1 for 7 shows a major dependency on CPU resources, i.e., any reduction of CPU availability would considerably decrease the performance of the program. Taheri et al. originally used the sensitivity analysis to model and predict performance changes of different programs running on co-located virtual machines (VMs) in virtualized data centers, i.e., cloud. In cloud infrastructures, performance of the services is highly dependent on the available resources and sharing of physical resources by VMs. We used the sensitivity values reported by Taheri et al. for 12 well-known benchmarks and applied their performance predication approach to our experimental testbed environment.
In their performance prediction approach, the throughput and consequently the response time of a program is estimated based on current/actual CPU, memory and disk utilization in proportion to the corresponding initial utilization values in an isolated execution run. They proposed an accurate estimation (» 90% accuracy in comparison to actual values [42]) of the program throughput upon applying resource limiting conditions, based on the following equation: represent the CPU, memory and disk sensitivity values of program j, and ℎ U represents the nominal throughput of program j in an isolated, contention and limitation free environment. The response time of the program was assumed as U = 1/ ℎ U in our testbed environment. Table 2 shows the used benchmarks and the corresponding sensitivity values. The collection listed in Table 2 includes different CPU-intensive, memory-intensive and disk-intensive types of programs and also programs characterized with different combinations of these resource sensitivities. Our experimental testbed simulated the performance behavior of different instances of the benchmarks characterized with various initial amounts of resources and also different values of response time requirements according to different application contexts. The proposed SaFReL was implemented and incorporated into the testbed environment to provide a smart self-adaptive stress testing. Two analysis scenarios were designed to evaluate the proposed framework for smart stress testing from different perspectives. The first analysis scenario focused on efficacy and applicability analysis of the framework on the benchmarks/SUT programs. In the second analysis scenario, the sensitivity of the learning efficacy to the learning parameters was studied.

Evaluating Metrics
The results of the efficacy and sensitivity analysis are described in terms of the following evaluation metrics: 1. Efficiency: The average number of learning trials needed by the tester agent to achieve the target, i.e., finding the performance breaking point. This metric shows the required effort to generate the proper test condition (test case) for finding the performance breaking point of the program.

Adaptivity:
The average number of extra learning trials required after the initial experience convergence in the initial learning for achieving the target upon observing a SUT instance.

Efficacy Analysis
In this analysis, the efficacy of the proposed SaFReL was evaluated based on its efficiency in generating stress test conditions to find the performance breaking point of different SUTs and its adaptation capability upon observing SUTs with performance sensitivity different from previously observed ones. In the efficacy analysis, the behavior of SaFReL during both initial and transfer learning phases was analyzed. During the efficacy analysis of the transfer learning phase, we considered both a set of SUT instances similar from the aspect of performance sensitivity to resources, i.e., similar in main demanded resource (homogenous SUTs), and a set of SUT instances different in performance sensitivity (heterogeneous SUTs). The simulated SUT instances in our simulation environment received different amounts of CPU, memory and disk resources initially at the start of their executions. The resources, i.e., number of CPU cores, memory and disk, were initialized with different values in the range [1,10], [1,50] GB, [100, 1000] GB respectively. The SUT instances in our analysis assumed various predefined response time requirements ranging from 500 to 3000 ms. The performance breaking points of the SUT instances were defined as 1.5 times higher than their response time requirements. Different variants of e-greedy algorithm as action selection strategy were studied during the efficacy analysis. The efficacy of SaFReL was analyzed regarding the use of four variants of egreedy, i.e., e=0.2, e=0.5, e=0.85 and decaying e and also our proposed adaptive e selection method.
Initial learning. Fig. 4 shows the efficiency of SaFReL in terms of number of required learning trials in learning episodes during the initial learning, regarding the use of four different configurations of e. In Fig. 4 the blue plot shows the required number of learning trials in learning episodes, and the red plot shows the average of the required learning trials. Table 3, as a quick summarized view, presents the average efficiency of SaFReL using different variants of action selection strategy in the initial learning. As shown in Fig. 4 and Table 3, using e-greedy with e=0.2, results in the highest efficiency in terms of least required learning trials and also preserving the efficiency very close to the average value, for more than 90% of the learning episodes.  Fig.4 Efficiency of SaFReL in learning episodes during the initial learning Transfer learning. After the initial learning, SaFReL tries to reuse the policy learnt during the initial learning in the interactions with further SUT instances, while still keeping the learning running. We used e-greedy, e=0.2, as an efficient variant of e-greedy for the initial learning during the rest of experiments for efficacy analysis of transfer learning phase. The optimal policy achieved in the initial learning is not influenced by the used action selection strategy, since Q-learning is an off-policy algorithm, i.e., it eventually learns regardless of the action selection strategy. In the following sections, we investigate the efficacy of SaFReL in terms of the efficiency and adaptivity metrics, when acting on homogeneous and heterogeneous set of SUT instances.
I. Homogeneous set of SUTs. We chose CPU-intensive programs as a homogeneous set of SUT instances during our analysis in this step. We simulated the performance behavior of 50 instances of our CPU-intensive benchmarks, i.e., Build-apache, n-queens, John-the-ripper, Apache, Dcraw, Build-php, X264, and varied both the amount of initial resources granted and the response time requirements. Fig. 5 shows the efficacy of SaFReL acting on a homogeneous set of different CPU-intensive SUT instances while using four different configurations of e. In Fig. 5, like previous plots, the blue plot shows the required learning trials of SaFReL in interactions with SUT instances, the red plot shows the average number of required learning trials in transfer learning, and the grey plot illustrates the average learning trials in the initial learning. Table 4 presents the improvement of the average required learning trials (efficiency) of SaFReL in the transfer learning phase compared to the initial learning. This improvement shows the sufficiency of the learnt policy which makes the agent do less trials to find the performance breaking point of new cases.  II. Heterogeneous set of SUTs. In this section of the efficacy analysis, we studied the efficacy of SaFReL on a heterogeneous set of SUT instances containing various CPU-intensive, memory-intensive and disk-intensive ones in terms of analysis of efficiency and adaptivity. We simulated the performance behavior of 50 SUT instances of all our benchmarks. Then, we investigated the efficiency of SaFReL on the heterogeneous set of SUT instances regarding using e-greedy with e=0.2, 0.5, 0.85 and decaying e, as shown in Fig. 6. When the smart agent acts on a heterogeneous set of SUTs, replaying the learnt policy is not sufficient for all observations and the agent needs more exploration upon observing new cases, therefore here upon using the previous configurations of e, the typical ones, the average number of learning trials in transfer learning is not less than or equal to the initial learning phase. Table 5 presents the difference between the average of learning trials in initial learning and transfer learning phases, i.e., the extra learning trials required in the transfer learning phase, when SaFReL acted on a heterogeneous set of SUTs.   As described in section 5, to improve the efficacy of SaFReL on a heterogeneous set of SUTs, it was augmented with a simple meta-learning feature enabling it to detect the heterogeneity of the SUT instances and adjust the value of parameter e, adaptively. In general, it means that when the smart tester agent observes a SUT instance different from the previously observed ones, it leads the action selection strategy towards doing more exploration and upon detecting the same type of SUT instance as the previous ones, it makes the action selection strategy strive for more exploitation.
As illustrated before, this added feature works based on measuring the similarity between SUTs at two levels of observations, then based on the measured values, it adjusts the value of parameter e. The threshold values of similarity measures and the adjustments for parameter e in this experimental analysis are described in listing 3.

Algorithm 3. Adaptive e selection used in the efficacy analysis
else if ( , + < . ) = . Fig. 7 shows the efficiency of SaFReL regarding the use of similarity detection and adaptive e-greedy action selection strategy on a heterogeneous set of SUTs. Using adaptive e selection caused the extra learning trials to decrease to ~ 4, and resulted in the efficiency improvement in the transfer learning phase. To facilitate the efficacy analysis of SaFReL on a heterogeneous set of SUTs regarding using different variants of action selection strategy including adaptive e selection, we also studied the efficacy of SaFReL in terms of its adaptivity while acting on a heterogeneous set of SUTs. Fig. 8 depicts the adaptivity of SaFReL on a heterogeneous set of SUTs by showing the required learning trials versus similarity values for different configurations of e. It shows how many learning trials are required, regarding the detected degree of similarity between the current observation and the previously observed ones.

Sensitivity Analysis of Learning Efficacy
In this part of our analysis, we studied the impact of the learning parameters including learning rate (a) and discount factor (g) on the efficacy of SaFReL on both homogeneous and heterogeneous sets of SUTs. For conducting sensitivity analysis, we implemented two sets of experiments that involve changing one learning parameter while keeping the other one constant. For the sensitivity analysis of the learning on a homogeneous set of SUTs, we used egreedy, e=0.2 for both initial and transfer learning phases and on a heterogeneous set of SUTs, we used e-greedy, e=0.2 and adaptive e selection for the initial and transfer learning phases respectively. During the sensitivity analysis experiments, to study the impact of the learning rate we set the discount factor to 0.5, while to examine the impact of the discount factor we kept the learning rate fixed to 0.1. Fig. 9 shows the sensitivity of SaFReL acting on a homogeneous set of SUTs (CPU-intensive) to changing learning rate and discount factor parameters. Fig. 10 depicts the results of the sensitivity analysis of SaFReL on a heterogeneous set of SUTs.

Discussion
Efficacy Analysis. Using multiple experiments, we studied the efficacy of SaFReL in its two operational phases, i.e., 1) initial learning and 2) transfer learning, on both a set of homogeneous and heterogeneous SUTs derived from 12 benchmarks defined by [42]. We examined the effects of using different action selection strategies on the efficacy of SaFReL in terms of efficiency and adaptivity.  The results of the efficacy analysis on initial learning, i.e., the initial convergence of SaFReL experience base, as presented in Fig. 4 and table 3, showed that using e-greedy with e=0.2 led to the fastest initial convergence after ~7 learning episodes. Note, however, the e-greedy with decaying e also yielded almost the same average efficiency, but not as fast as e=0.2.
During the efficacy analysis of SaFReL in the transfer learning phase, we studied its efficiency and adaptivity while acting on homogeneous and heterogeneous sets of SUTs. The results of the experiments running on a set of 50 CPU-intensive programs as our homogeneous set of SUTs, Fig. 5 and table 4, showed that using e-greedy with e=0.2 as action selection strategy in the transfer learning led to the best average efficiency. It caused SaFReL to rely more on the learnt policy during the transfer learning and resulted in efficiency improvement, since there was a rather strong similarity between the performance sensitivity of SUT instances in the homogeneous set of SUTs. The existing similarity makes the strategy of using previously learnt policy successful in operating efficiently on a homogeneous set of SUTs.
Furthermore, we studied the efficacy of SaFReL on a heterogeneous set of 50 SUTs containing different CPUintensive, memory-intensive and disk-intensive software programs. The results of our analysis illustrated that choosing an action selection strategy without considering the heterogeneity among the SUT instances did not lead to desirable efficiency. Among the typical variants of e-greedy action selection strategy, in contrast to homogeneous set of SUTs, e-greedy with e=0.2 relying mostly on using the learnt policy, did not work well on a heterogeneous set of SUTs and led to the highest extra learning trials. As shown in Fig. 6 and table 5, using a fixed strategy without considering the heterogeneity between SUTs could not perform efficiently. Then, we augmented the proposed RL-based approach with an adaptive action selection strategy which was heterogeneity-aware by measuring the similarity between the performance sensitivity of the observed SUTs and adjusting the e parameter adaptively. Fig. 7 showed the efficiency of SaFReL on a heterogeneous set of SUTs regarding using adaptive the e-greedy action selection strategy. As shown in Fig. 7, it led to a reduction in the average number of extra learning trials and provided a smoother trend in the efficiency. It tried to preserve the number of learning trials around the average efficiency of initial convergence and caused the agent to be able to reuse its learnt policy according to the conditions, which means that it used the learnt policy when it would be useful and did more exploration whenever required.
At the last part of the efficacy analysis, we extended our analysis by measuring the adaptivity of SaFReL when acting on a heterogeneous set of SUTs. Fig. 8 illustrated the adaptivity of SaFReL in terms of the required learning trials versus the detected similarity between SUTs. It showed better adaptivity in SaFReL when it used the adaptive e-greedy strategy. In addition to reducing the deviation of average efficiency in the transfer learning from the average efficiency of initial learning, it preserved the required learning trials for all observed SUT instances similar in performance sensitivity, around the average efficiency of initial learning. It demonstrates that using adaptive e-greedy action selection strategy, SaFReL is able to detect where it should use the previously achieved policy and where it should update its experience and do more exploration.
Sensitivity Analysis of Learning Efficacy. During the sensitivity analysis experiments, we investigated the effects of learning parameters, i.e., learning rate (a) and discount factor (g) on the efficacy of SaFReL on both homogeneous and heterogeneous sets of SUTs in a systematic way. Regarding the homogeneous set of SUTs, the reduction of learning trials in transfer learning compared to the initial learning is important and determinant of efficiency, since we are interested in exploiting the already achieved policy. Then, the results of our sensitivity analysis experiments showed that adjusting the learning rate to lower values such as 0.1 led to improved efficiency. As shown, setting the learning rate to 0.1 caused the highest efficiency in our analysis. Furthermore, regarding analyzing the sensitivity of SaFReL to discount factor on a homogeneous set of SUTs, the experimental results depicted that lower values of discount factor resulted in better efficiency.
At the second stage of the sensitivity analysis, we studied the impacts of changing learning parameters on the efficacy of SaFReL on the heterogeneous set of SUTs. On a heterogeneous set of SUTs, reducing the number of extra required learning trials and increasing the adaptivity of SaFReL is of importance. Therefore, the results in our analysis showed that again setting the learning rate to lower values such as 0.1 led to the lowest extra learning trials and hence maximum adaptivity of SaFReL. Also, regarding analyzing the impact of discount factor, the results of the experiments illustrated that the value of 0.5 for discount factor gave a desirable performance in terms of fewer extra learning trials and better adaptivity while acting on a heterogeneous set of SUTs.
Lessons Learned. The experimental evaluation conducted on the proposed framework for autonomous performance testing depicted that how machine learning can guide the performance testing towards being automated and even one step further, i.e., being done in an autonomous manner. Common approaches for generating the stress test conditions are mostly based on relying on source code or system models which often raise the issues of limitations.
Moreover, drawing a precise model of the system which predicts the state of the system upon affecting factors, is hard and usually not available. This makes a room for machine learning, particularly model-free learning techniques to play a role. Model-free reinforcement learning is a machine learning technique enabling the learner to explore the environment (the behavior of the system in this case) and learn the optimal policy to achieve the target (finding the performance breaking point in this case) without having a model of the system. The learner stores the learnt policy and is able to replay the learnt policy in the situations analogous to previously observed ones. This feature leads to reduction in the effort of the learner to achieve the target in further situations and consequently leads to efficiency improvement. These characteristics make model-free RL algorithms appropriate to address the challenge of automating performance testing, particularly stress testing. It reduces the dependency on software models and performance models for applying stress testing on different software systems. Variety of application-, platform-and workload-wise factors affecting the performance lead to varying execution states for the software system, and upon different execution states, different test conditions are required to find the performance breaking points of the SUT. In this context, the smart agent using model-free RL can learn how to provide the test conditions through modifying the external factors adaptively to find the performance breaking points in different execution conditions.
Regarding the learning objectives, the agent learns indirectly the optimal policy to find the performance breaking point of the SUT under different and varying execution conditions efficiently. One further step is what we can achieve through the transfer learning, meaning that the agent can reuse its learnt policy in interaction with further SUTs. Proper formulating of RL approach on the problem is of great importance to get a desirable performance of the learning. Modeling the state space, actions and also reward function should be sufficient and effective enough to guide the agent throughout the learning and make it learn the optimal policy. However, RL approaches generally suffer from a long convergence (training) period. Therefore, applying some heuristics and adjusting techniques to speed up the exploration of the problem state space like using multiple cooperating agents, fuzzy state space modeling, and also some control mapping could be helpful, particularly in highly dynamic problems. Nonetheless, this issue was not a concern in our case.
In general, automation, improved efficiency in terms of time and cost, and less dependency on models are profound implications that would be expected of applying model-free RL algorithms to stress testing with the aim of finding performance breaking points of software systems. Moreover, in the context of applicability, with respect to the shown efficacy of the proposed approach on homogeneous set of SUTs, its application in stress testing of evolving software systems during the continuous integration/delivery (CI/CD) process and also software variants in software product lines would be concrete examples of its applicability. Application-wise the similarity between those variants are fairly considerable and could truly represent a homogeneous set of SUTs, therefore regarding the desirable efficacy of this smart approach on homogeneous SUTs, it can be expressed that this area could benefit from this approach considerably.
Changes in Future Trends. As described previously, workload-wise causes-effects is one of the significant parts considered in performance evaluation. Load testing is mainly applied to the software systems based on client-server architecture. Therefore, with the emergence of serverless architecture which incorporates third-party backend services (BaaS) and/or runs the server-side logic in state-less containers fully-managed by providers (FaaS), a slight shift in the objectives of load testing is expected. Within the serverless architecture, the backend code is run without need to manage and provision the servers. Therefore, for example for FaaS, scaling including the resource provisioning and allocation is automatically done by the provider whenever needed, to preserve the response time requirement (SLA) of the application. In general, regarding the auto-scaling capability of many cloud platforms and the benefits of using serverless architecture, the objectives of stress testing might be influenced because the scalability at both platform and application level is automatically handled by the providers. Nevertheless, stress testing is still crucial for a wide range of systems like telecommunication infrastructure, and critical applications running on embedded systems.

Related Work
Verifying non-functional properties such as performance against the associated requirements is crucial for many software systems in safety and mission critical domains. Performance is one of the non-functional aspects of software systems which plays a key role in the functionality acceptance of performance-critical software systems. Verifying performance could be carried out at any steps of the software development life cycle.
Performance modeling is a common approach for estimating the performance metrics to verify the performance requirements. It generally involves identifying the proper performance indices (metrics), building a performance model expressing the relevant indices. Consequently, different model-driven engineering techniques like model verification, model refactoring, performance tuning and configuring could be performed based on the performance model. There are various modeling notations specific for performance modeling including queueing networks, Markov process, petri nets, process algebras and simulation models [9,10,11,12]. Automating the performance model generation has been considered remarkably to replace the manual model generation and provide agile performance analysis to the early stages of software development life cycle. There is an extensive literature published in the field of performance modeling [13,14,15,16] Performance testing is another family of approaches intended to address the objectives of performance analysis. Performance, load and stress testing are the terms often used interchangeably, even though there are also some definitions/interpretations to make a distinguish between them [17]. Performance testing is crucial to many performance-critical software systems like core banking systems and industrial control systems. Despite this need, there are also a few reports expressing the insufficient systematic approaches for performance software testing [1,26]. Load testing is generally specified as the assessment of the performance behavior of the software system under expected load in the field. Detecting the load-related functional problems of the SUT and also any violation of nonfunctional requirements such as performance, reliability or stability requirements are the main objectives of load testing [17]. Stress testing is often defined as the robustness assessment of the SUT under extreme conditions such as heavy workload, and limited resource availability. Performance testing can be generally considered as a general term including both load and stress testing. However, in some pieces of literature, some specific definitions have been proposed to distinguish between these types of performance-related testing. For example, in [17] performance testing has been defined as the measurement and/or evaluation of performance indices of the SUT such as response time, throughput and resource utilization. The objectives were also limited to verification of performance requirements and in case of lack of performance criteria to the principle of "no worse than previous" [43]. The definitions of performance, load and stress testing overlap in many areas. Nonetheless, the objectives of all those performancerelated testing types and the relevant work could be summarized as follows: I. Measuring the performance metrics under different execution conditions including different workload and resource configurations. This objective might be also satisfied through performance modeling in some cases. This process can be done under both typical or stress conditions [29,44,45,46,47,48,49,50,51,52,53,54] II. Detecting the functional problems appearing under certain execution conditions including workload and resource configurations regarding both expected load and stress conditions [55,56,57,58,59,60] III. Detecting the violation of non-functional requirements such as performance, reliability, and robustness requirements under expected and stress conditions [29,50,61,62,63] Stress testing scenarios to verify the robustness and find the breaking points of the system, are usually considered as testing scenarios following the objectives II and III. Different approaches have been proposed to design faultinducing test conditions. For example, analyzing source code and system behavior are the main general approaches used in the literature for designing stress test conditions based on fault-inducing workload. Deriving loads using dataflow analysis and symbolic execution are examples of techniques for designing fault-inducing load based on source code analysis to detect functional (like memory leaks) and performance (exceeded response time) problems [28,58]. Using linear programs and genetic algorithms are the techniques used for designing loads based on system behavior analysis to detect the performance problems [29,30,31] Our proposed framework for autonomous stress testing uses reinforcement learning to learn how to generate stress test conditions for finding performance breaking point of software systems. It is intended for reducing the dependency on the analytical analysis and performance models for generating stress test conditions. Regarding the use of learning, it strives to improve the efficiency through optimizing the convergence rate and reusing the learnt policy on similar cases. On this way, fuzzy state modeling and adaptive action selection strategy aid the adaptability and also enhance the efficacy via considering the uncertainty.

Conclusion
Performance analysis with the aim of estimating the performance metrics, detecting functional problems or nonfunctional violations remains as a challenge specifically for complex systems. Model-driven techniques are common approaches for conducting performance analysis, particularly for the purpose of measuring performance metrics. Nonetheless, generating a precise model of the performance behavior of a software system and also the application-, platform and workload-wise affecting factors is too cumbersome, particularly for complex systems. Meanwhile, still many details such as implementation and deployment details are ignored during the modeling. Performance testing is another common family of techniques for performance analysis, specifically for accomplishing the objectives of detecting violations of non-functional requirements and the functional problems emerging under certain conditions. Source code analysis, using evolutionary algorithms based on system models and use-case based design approaches are some general approaches for generating stress test conditions. In this paper, we proposed a model-free reinforcement learning-based stress test condition generator to find the performance breaking point of different software systems. In particular, we presented a framework expressing the architecture and operational procedure of a smart tester agent being able to learn the optimal policy to generate the stress test conditions for various software programs. We used Q-learning accompanied with fuzzy state modeling and an action selection strategy adaptation to provide a self-adaptive autonomous tester which can learn, reuse its learnt policy and adapt its strategy to different observed cases. The proposed smart tester assumes two learning phases, i.e., initial and transfer learning. The optimal policy is learnt basically during the initial learning. After the initial learning, it tries to reuse the learnt policy, while still keeping the learning running, during the interactions with further SUT instances in the transfer learning. Automation and improved efficiency in terms of less required effort, i.e., time and cost, for generating the test conditions are the considerable achievements of using learning-based approaches in this field. We evaluated the efficacy of the proposed approach using application benchmarks and a simulation environment for simulating the performance behavior of software programs. According to the results of the efficacy analysis of the approach in our experimental evaluation, it is demonstrated that software variants in software product lines and evolving software programs in CI/CD would be well-suited application areas for applying this approach.
Extending the proposed approach to support other external affecting factors, i.e., workload-wise factors and adding proper associated actions to provide stress testing for client-server based software systems, will be further steps to continue this research. Using a virtualized testbed infrastructure to run the evaluation experiments would be also the platform for the evaluation experiments of the next steps.