Learning and statistical model checking of system response times
 153 Downloads
Abstract
Since computers have become increasingly more powerful, users are less willing to accept slow responses of systems. Hence, performance testing is important for interactive systems. However, it is still challenging to test if a system provides acceptable performance or can satisfy certain responsetime limits, especially for different usage scenarios. On the one hand, there are performancetesting techniques that require numerous costly tests of the system. On the other hand, modelbased performance analysis methods have a doubtful model quality. Hence, we propose a combined method to mitigate these issues. We learn responsetime distributions from test data in order to augment existing behavioral models with timing aspects. Then, we perform statistical model checking with the resulting model for a performance prediction. Finally, we test the accuracy of our prediction with hypotheses testing of the real system. Our method is implemented with a propertybased testing tool with integrated statistical model checking algorithms. We demonstrate the feasibility of our techniques in an industrial case study with a webservice application.
Keywords
Statistical model checking Propertybased testing Modelbased testing FsCheck User profiles Response time Cost learning Performance testing1 Introduction
Performance testing is important, especially for critical systems. It is usually done with sophisticated load techniques that are computationally expensive and even infeasible when various user populations should be analyzed. Alternatively, the performance may be analysed by simulating a model of the system. Simulation allows faster analysis and requires less computing resources, but the quality of the model is often questionable. We present a simulation method based on statistical model checking (SMC) that enables a fast probability estimation with a model and also a verification of the resulting probabilities on the real system.
SMC is a simulation method that can answer both quantitative and qualitative questions. The questions are expressed as properties of a stochastic model which are checked by analyzing simulations of this model. Depending on the SMC algorithm, either a fixed number of samples or a stopping criterion is needed.
We implement our method with the help of a propertybased testcase generator that is originally intended for functional testing. Propertybased testing (PBT) is a random testing technique that tries to falsify a given property, which describes the expected behavior of a functionundertest. In order to test such a property, a PBT tool generates inputs for the function and checks if the expected behavior is observed. PBT tools were originally designed for testing algebraic properties of functional programs, but nowadays, they also support modelbased testing.
 (1)
First, we apply a PBT tool to run modelbased testing (MBT) with a functional model concurrently in several threads in order to obtain logfiles that include the response times of the tested webservice requests. Since the model serves as an oracle, we also test for conformance violations in this phase. This functional aspect was discussed in earlier work (Aichernig and Schumi 2016a), here the focus is on timing.
 (2)
Next, we derive responsetime distributions per type of service request via linear regression, which was a suitable learning method for our logs. Since the response time is influenced by the parallel activity on the server, the distributions are parametrized by the number of active users.
 (3)
These cost distributions are added to the transitions in the functional model resulting in, so called, cost models. These models have the semantics of stochastic timed automata (STA) (Ballarini et al. 2013). The name cost model shall emphasize that our method may be generalized to other type of cost indicators, e.g., energy consumption.
We also combine these models with user profiles, containing probabilities for transitions and input durations, in order to simulate realistic user behavior and the expected response time.
 (4)
These combined models can be utilized for SMC, in order to evaluate responsetime properties, like “What is the probability that the response time of each user within a user population is under a certain threshold?” or “Is this probability above or below a specific limit?”.
We apply them for a Monte Carlo simulation, in order to estimate the probability of such properties.
 (5)
Additionally, we can check such properties directly on the SUT, e.g., to verify the results of the model simulation. In principle, it is also possible to skip the model simulation and (statistically) test responsetime properties directly on the SUT. However, running a realistic user population on the SUT is timeconsuming and might not be feasible due to very long waiting times. A simulation on the model is much faster. Therefore, also properties that require a larger number of samples can be checked, e.g., using Monte Carlo simulation. We run the SUT only with a limited number of samples in order to check, if the simulation results of the model are satisfied by the SUT. Therefore, we test the SUT with the sequential probability ratio test (Wald 1973), a form of hypothesis testing, as this allows us to stop testing as soon as we have sufficient evidence.
Related work
A number of related approaches in the area of PBT are concerned with testing concurrent software. For example, Claessen et al. (2009) presented a testing method that can find race conditions in Erlang with QuickCheck and a userlevel scheduler called PULSE. A similar approach was shown by Norell et al. (2013). They demonstrated an automated way to test blocking operations, i.e., operations that have to wait until a certain condition is met. Another concurrent PBT approach by Hughes et al. (2016) showed how PBT can be applied to test distributed filesynchronisation services, like Dropbox. The closest related work we found in the PBT community is from Arts (2014). He shows a loadtesting approach with QuickCheck that can run user scenarios on an SUT in order to determine the maximum supported number of users. In contrast to our approach, Arts does not consider stochastic user profiles and modelbased simulation.
There exist various tools for performance testing and load generation (Vinayak Hegde 2014; Rina and Tyagi 2013), which are related to our approach, since they also support the simulation of user populations. For example, Neoload^{1} is a performance testing and measurement tool for mobile and web applications that can simulate user populations. A similar open source tool is Apache JMeter (Halili 2008). Initially, it was only build for websites, but recently it also supports other applications areas. Another tool called LoadRunner (Jinyuan 2012) from HP supports the simulation of thousands of users and it works for various software platforms, like .NET or Java.
The most related performancetesting approaches are mainly in the area of load or stress testing. For example, Menascė (2002) presented a load testing approach for web sites that works with user interaction scripts to simulate the user behavior. Another loadtesting method was introduced by Draheim et al. (2006). They showed the simulation of realistic user behavior with stochastic models and workload models in order to estimate the performance of web applications. A related stress testing approach was presented by Krishnamurthy et al. (2006). The work shows a synthetic workloadgeneration technique that is based on request logs, and should mimic real user behavior. In contrast to our work, classical performance or load testing is mostly performed directly on an SUT. With our approach, we want to simulate user populations on the modellevel as well.
Related work can also be found in the area of performance engineering or modelbased simulation methods Smith (1990), Woodside et al. (2007), Pooley and King (1999), and Book et al. (2005), i.e., various approaches apply a model to predict performance. For example, Becker et al. (2009) presented a prediction method with Palladio component models for the evaluation of componentbased software architectures. With their method, they predicted response times of an online music repository for concurrent system usage. Moreover, they compared their prediction against measurements from a real system. Lu et al. (2012) demonstrated a statistical responsetime analysis. Their approach takes responsetime samples for the construction of a statistical model that is applied to derive upper bounds for responsetime estimates. Most of these approaches only apply a modelbased analysis, and do not present an automated technique for the evaluation of their model on an SUT. In contrast, with our method we can perform a modelbased prediction, and we can also check the accuracy of our predictions by directly testing an SUT within the same tool.
There are also some approaches or tools that can do both, a simulation with a model and testing an SUT. Balsamo et al. (2004) gave an overview of various modelbased performanceprediction approaches and tools, and the mention tools, like SPE∙ED (Smith and Williams 1997), which works with message sequence charts and supports a modelbased simulation, as well as an evaluation of objectoriented systems. A disadvantage of such approaches is that they still require much manual effort, e.g., performance data is often only defined manually. In contrast, we also include an automated approach for responsetime learning with linear regression. Moreover, we can exploit PBT features, because our approach is realised within a PBT tool.
The most closely related tool is UPPAAL SMC (Bulychev et al. 2012). Similar to our approach, it provides SMC of priced timed automata, which can simulate user populations. It also supports testing real implementations, but for this, a test adapter needs to be implemented, which, e.g., handles formdata creation. With our method, we can use PBT features, like generators in order to automatically generate form data and we can model in a programming language. This helps testers, who are already familiar with this language, as they do not have to learn new notations.
To the best of our knowledge, our work is novel: (1) no other work applies PBT for evaluating stochastic properties about the response time of both real systems and stochastic models, (2) no other work performs cost learning on behavior models using linear regression. Grinchtein (2008) learns timedeterministic eventrecording automata via active automata learning. Verwer et al. (2010) passively learn probabilistic realtime automata. In contrast, we learn cost distributions and add them to existing automata models for SMC.
Contribution
This article is an extended version of an ICTSS conference paper (Schumi et al. 2017). Compared to this previous work we present the following new contributions:
(i) A major new contribution is an extensive description of our costmodel learning approach. The conference paper gave a brief overview only. Here, we focus more on the learning process. We highlight various learning steps, like data cleaning and feature engineering, which are key to obtain a good model. (ii) Another new contribution is the extension of our method to enable an assessment of the prediction power of our learned model. In our previous work, we only checked if the SUT is at least as good as the model in responding within a certain time limit. Now, we also check if the real probability of the SUT is close to the predicted probability of the model by utilising a twosided hypothesis test. (iii) Moreover, we present an additional industrial case study. We evaluate our approach by applying it to an updated and extended version of the SUT we have used previously.
In relation to our presented process of Fig. 1, our main contributions in this work are located in the costmodel learning and hypothesestesting phases. However, we also tried to optimize the presentation of all steps of our approach, and we illustrate a more detailed evaluation.
Structure
First, Section 2 introduces our SUT and the necessary background regarding SMC, PBT, and their combination. Next, Section 3 explains how we perform costmodel learning. In Section 4, we present our method with an example. In Section 5, we give more details about the process and implementation. Section 6 presents an evaluation based on an industrial webservice application. In Section 7, we discuss limitations of our method and finally, we draw our conclusions in Section 8.
2 Background
2.1 Systemundertest
Our approach was evaluated on a webservice application provided by our industrial project partner AVL.^{2} The application originates from the automotive domain and is called Testfactory Management Suite (TFMS).^{3} It is a workflow tool that supports the process of instrumenting and testing automotive power trains – a core business of AVL. TFMS captures testbed data, activities, resources, and workflows. A variety of activities can be realized with the system, like test definition, planning, preparation, execution, data management, and analysis (Aichernig and Schumi 2017b).
The application is intended for various kinds of automotive test beds for car components, like engines, gears, power trains, batteries for electric cars or entire cars. For instance, for testing an engine it is mounted to a pallet and also different test equipment is attached. The selection of the test equipment depends on the specific use case. Typical test equipment for an engine might be a measurement device for the power output or the fuel consumption. After a pallet is configured, it is moved to the test bed, where all devices are connected and a test is performed. TFMS manages all steps and needed devices of such a workflow, which is also called a test order. It allows the scheduling of car components that need to be tested, the selection of required test equipment, the definition of the needed wiring for the equipment, and the planning of the sequence of all tasks at a test bed. Moreover, customer specific requirements, like additional management steps or custom restrictions can be freely configured via businessrules.
The system consists of multiple modules corresponding to the mentioned clients. Modules can be seen as groups of functionality and they consist of multiple businessrule models which describe what tasks can be performed by a user and how they look like, e.g., what data can be modified. Only one businessrule model can be active at the same time and it determines, which forms can be opened in the current state of the system (Aichernig and Schumi 2017b).
A businessrule model is a state machine defining the behavior of the business objects, so called TFMS Objects. A TFMS object class describes objects of our application domain, like test equipments or test orders. Each object has a state, an identifier, attribute values/data and is stored in the database of our SUT. TFMS works task based. Tasks represent the behavior, i.e., the actions or events a user may trigger, e.g., creating or editing TFMS objects. An example businessrule model and description of tasks and subtasks are presented in Section 4.1.
TFMS is a critical software, because it is essential to efficiently operate test beds. It is deployed to various customers where it is running under different hardware and network settings. Moreover, it is applied for several application fields and under varying usage conditions, i.e., with several users and different user types. It is important for AVL that the system is still fast enough to satisfy even high numbers of concurrent users. Hence, in this work, we are investigating the performance of TFMS for various usage scenarios.
2.2 Statistical model checking
Statistical model checking (SMC) is a verification method that evaluates certain properties of a stochastic model. These properties are usually defined with (temporal) logics, and we may ask quantitative and qualitative questions about their satisfiability. For example, what is the probability that the model satisfies a property or is the probability that the model satisfies a property above or below a certain threshold? In order to answer such questions, a statistical model checker produces samples, i.e., random walks on the stochastic model and checks whether the property holds for these samples. Various SMC algorithms are applied to compute the total number of samples needed in order to find an answer for a specific question or to compute a stopping criterion. A stopping criterion determines when we can stop sampling, because we have found an answer with a required certainty. In this work, we focus on the following algorithms, which are commonly used in the SMC literature (Legay et al. 2010; Legay and Sedwards 2014).
Monte Carlo simulation with ChernoffHoeffding bound
The n simulations represent Bernoulli random variables X_{1},…,X_{n} with outcome x_{i} = 1 if the property holds for the ith simulation run and x_{i} = 0 otherwise. Let the estimated probability be \(\bar {\gamma }_{n} = ({\sum }_{i = 1}^{n} x_{i}) / n\), then the probability that the estimation error is below 𝜖 is greater than our required confidence. Formally we have: \(Pr( \bar {\gamma }_{n}  \gamma  \le \epsilon ) \ge 1  \delta \). After the calculation of the number of samples n (1), a simple Monte Carlo simulation is performed (Legay and Sedwards 2014).
Sequential probability ratio test
In this work, we form a hypothesis about the expected response time with the Monte Carlo method on the model. Then, we check with sequential probability ratio test (SPRT) if this hypothesis holds on the SUT. This is faster than running Monte Carlo directly on the SUT.
2.3 Propertybased testing
Propertybased testing (PBT) is a randomtesting technique that aims to check the correctness of properties. A property is a highlevel specification of the expected behavior of a functionundertest that should always hold.
PBT also supports MBT. Models encoded as extended finite state machines (EFSMs) (Kalaji et al. 2009) can serve as source for statemachine properties. An EFSM is a 6tuple (S,s_{0},V,I,O,T). S is a finite set of states, s_{0} ∈ S is the initial state, V is a finite set of variables, I is a finite set of inputs, O is a finite set of outputs, and T is a finite set of transitions. A transition t ∈ T can be described as a 5tuple (s_{s},i,g,op,s_{t}); s_{s} is the source state, i is an input, g is a guard, op is a sequence of output and assignment operations, and s_{t} is the target state (Kalaji et al. 2009). In order to derive a statemachine property from an EFSM, we have to write a specification comprising the initial state, commands, and a generator for the next transition given the current state of the model. Commands encapsulate (1) preconditions that define the permitted transition sequences, (2) postconditions that specify the expected behavior, and (3) execution semantics of transitions for the model and the SUT. A statemachine property states that for all permitted transition sequences, the postcondition must hold after the execution of each transition, respectively command (Hughes 2007; Papadakis and Sagonas 2011). Formally, we define such a property as follows:
A PBT tool generates random sequences of commands in order to test this property. For generating the input data, it is possible to define custom data generators for each type of input. Put simply, a generator Gen[A] is defined for a type A and provides a function sample : A that returns an instance of this type. For complex data, default and custom generators can be freely combined and even nested. This makes PBT an ideal candidate to test web services with complex input forms.
The first PBT implementation was QuickCheck for Haskell (Claessen and Hughes 2000). Numerous reimplementations followed for other programming languages, like Hypothesis^{4} for Python or ScalaCheck (Nilsson 2014). We build upon our previous work (Aichernig and Schumi 2016a) and demonstrate our approach with FsCheck.^{5} It is a .NET port of QuickCheck influenced by ScalaCheck. It supports a property definition in both, a functional programming style with F# and an objectoriented style with C#. We work with C# as it is the programming language of our SUT.
2.4 Integration of SMC into PBT
Then, our properties perform an SMC algorithm by utilizing the PBT tool as a simulation environment and return either a quantitative or qualitative result, depending on the algorithm. For example, a statemachine property can be applied for a statistical conformance analysis by comparing an ideal model to a faulty stochastic system. Additionally, it can also simulate a stochastic model.
3 Costmodel learning
The response time distributions to be added to the transitions in the functional model are a key part of the method presented in this paper. How can we derive such distributions? Implementing a classical rulebased algorithm is not feasible since appropriate ifthenelse rules with the associated conditional expressions and calculation formulas for the distribution parameters are hard to be defined a priori in our context. However, we can recognize that we have all the necessary ingredients for a datadriven learning approach, more precisely, for supervised learning with regression. We have logfiles with a large number of request examples (instances) for which also the response times (labels) are known. For each request example, the logfile specifies the values of a number of attributes (features) related to the requests. Our regression task is to learn from the (labeled) data given by the logfiles, a function which, given the attribute values of a request instance, returns the parameters (μ, σ) of a normal distribution for the response time for that instance.
As we will see in Section 4.2, it turns out that the response times can be fairly well approximated by a linear combination of the request attributes by using the linear regression method. This comes in handy since (i) the statistical properties of the resulting estimators, i.e. the weights of the request attributes, are easier to determine with linear regression than with other learning algorithms, and (ii) we can use these statistical properties to derive the normal distribution parameters of the response times.
Multiple linear regression
Given a logfile with N examples of requests and their response times, y is the N × 1 vector of the response times and X is the N × p design matrix for p request attributes considered to linearly influence the response time, where y_{i} is the response time and X_{i,1}, ..., X_{i,p} are the attributes of the i th request example in the logfile.
We can use y and X with the (4) to estimate the model parameters β that minimize the error term 𝜖. Note that 𝜖 is a N × 1 vector and there are various ways to define what “minimize 𝜖” means. The simplest and most common method is the ordinary least squares (OLS) which minimizes the sum of the squares \({\epsilon _{i}^{2}}\), i = 1,...,N.
The normal distribution \(\mathcal {N}(\mu _{y}, {\sigma _{y}^{2}})\) with parameters given by (7) is exactly what we are looking for. Thus, given a logfile of request examples with corresponding response times, we learn the parameters \((\mu _{\beta _{k}}, \sigma _{\beta _{k}})\) of a model (7) which gives the normal distribution \(\mathcal {N}(\mu _{y}, {\sigma _{y}^{2}})\) of the response time y for any new request with known attributes [x_{1},...,x_{p}] to be associated to the behavioral model as needed.
4 Method
This section shows how we derive cost models from logs and how we can apply these models to simulate stochastic user profiles.
4.1 Logdata generation
Each task consists of subtasks, e.g., for setting attributes or for opening a screen. The subtasks of one task can be seen in the center of Fig. 4. Many subtasks require server interaction. Therefore, they can also be seen as requests.
Based on these functional models, we can perform conventional PBT, which generates random sequences of commands with form data (attributes) in order to check properties that test the functionality of the SUT. We test statemachine properties (3) as explained in Section 2.3, i.e., we perform random walks on the model and check for output and variable equivalence of the model and the SUT in the postconditions. In our case, the output is just the current state after the execution of a command, and since we encode the form data in variables, we test if the variable values are correctly transferred to the SUT and that there is no problem with the database. While these properties are tested, a log is created that captures the response times (costs) of individual requests. The properties are checked concurrently on the SUT in order to obtain response times of multiple simultaneous requests, which represents the behavior of multiple active users.
Example log data of the Business Process Template model
Task  Subtask  #ActiveUsers  Attribute  #Attributes  ObjSize  CumulativeObjSize  Response time[ms] 

Create  StartTask  5  –  –  –  7115966  21 
Create  SetRefAttr  4  Responsible  –  0  7119938  17 
ChangeState  StartTask  2  –  –  0  7119938  22 
Create  Commit  5  –  5  3985  7123923  31 
ChangeState  Commit  4  –  2  3372  7181842  25 
Note that with PBT we can freely choose the number of test cases and the length of each test case. This allows us to control the size of the generated data, which is helpful for our learning method, since we need to try out different data sizes.
4.2 Learning from the logdata
Learning a model from the logfiles is a datadriven approach, hence the quality and the accuracy of the logdata are of crucial importance. Unfortunately, it is not possible to obtain logfiles through usage of the webservice application (SUT) by realworld users and directly logged by the SUT due to various reasons: (i) it is hard to get permission from SUT customers to use their realworld data from production environments (confidentiality reasons), (ii) the time needed for the logfiles generation depends on the usage frequency by the realworld users and is generally long, (iii) logging of critically important attributes, e.g., database fill level, is not acceptable due to invasive character (frequent access to database registers through the logging client might negatively impact the SUT performance). However, we overcome this problem by using a user simulation tool as shown in Section 4.1.
The advantages of the user simulation approach are threefold: (i) we can run the user simulation tool on demand, whenever and with whatever environment setup we need, (ii) we can simulate users with different usage profiles (e.g., some users are typing faster, hence sending requests to SUT with a higher frequency), (iii) we are flexible to record any required attribute, through simulation if necessary, without SUT (re)coding which would be critical due to release cycle constraints and possible invasive character. For instance, the database fill level might be simulated on the tool side by maintaining an internal variable CumulativeObjSize which is incremented, resp. decremented, every time a request is storing to, resp. removing from, the SUT database. The step size used is the objects total size ObjSize.
A disadvantage of the user simulation approach is, however, that we might generate biased logfiles, e.g., by selecting some parameter setups much more often than others, if we do not carefully define the experimental setup. Moreover, measurement errors might be introduced due to (i) network latency (the simulation tool is not run on the SUT machine in order to do not impact on the SUT performance), (ii) approximations of simulated attributes, e.g., CumulativeObjSize, or (iii) tool execution time overhead. However, we keep these errors low, and consider them irrelevant, by (i) running the tool on the local network as close to the SUT as possible, (ii) consistently updating the internal variables for the simulated attributes, and (iii) implementing the simulation tool with realtime requirements in mind in addition to allocating sufficient hardware resources, respectively.
 R.1
Distributions. The predictions for response times are required as distributions \(\mathcal {N}(\mu _{y}, {\sigma _{y}^{2}})\) and not as single values.
 R.2
Realtime. Compute “good enough” predictions in realtime (ca. 1 ms).
 R.3
Portability. The prediction model is generated in the programming language python but needs to be embedded externally into the overall tool (C#code).
The requirement R.1 is due to the SMC approach followed in this work. R.2 is necessary since the simulation of the SUT is done with a virtual time, i.e., a fraction of the actual time, in order to simulate thousands of requests within hours which would otherwise take days on the SUT. R.3 is a consequence of R.2. More precisely, we can further improve the predicting speed if the algorithm which computes the prediction is natively implemented directly in the module which needs the prediction and not in an external module which would cost additional (precious) time to invoke. As we will see below, the above specific model requirements will affect some choices we have to make during the model learning process.
For learning the cost model described in Section 3, the journey starts with defining the experimental setup. Corresponding logfiles, one logfile for each simulated user, are then generated as explained in Section 4.1. The logfiles contain many examples of requests, typically up to two million examples all together. For each request, the logfile specifies a number of related attributes including, importantly, the response time, i.e., the time needed by the SUT to process the request and send back a reply. The journey for learning the cost model goes further through several phases as described below.
Data cleaning and preprocessing

Sporadic requests with missing values for mandatory attributes are removed.

Nonmandatory request attributes with missing values are filled with default values. E.g., Attribute is not required to have a value for every request type.
Thus, we fill missing values of Attribute with a default value “NOTSET” as a further category.

Request examples with an error message received back from the SUT are removed since our goal is to test properties of the SUT under normal usage and network conditions.

The request examples generated during the first five minutes of the logging phase are removed since during this time various system initialization tasks (e.g., database user authentication, network connection setup, etc.) are performed. These initialization tasks affect in an atypical way the response times of the system and should not be considered for learning (s. Fig. 5).

Sporadic request examples with unusually long time responses, e.g., due to some temporary network problems, are considered outliers and removed, for instance the data point in Fig. 5 around time 01:20.
Feature selection and engineering
Relevant attributes (features) are selected which are believed to influence the response time of the SUT. Again, descriptive statistics and visualizations of the data from the logfiles together with a good knowledge about the way how the SUT works and is built are key factors for identifying relevant features. The better we understand the data, the better and more accurate the models that we can build (Tang et al. 2014). For instance, Fig. 5 suggests that the response time depends on some variable changing over time which, excluding the possibility of any memory leak, is very likely to be the database fill level of the SUT.

Using the concatenation Task_Subtask as a feature instead of two separate features Task and Subtask would improve the model performance by ca. 9%. This seems reasonable since different task types have subtasks with the same name but different effects, depending on the corresponding task type. Using Subtask as a standalone feature would introduce some noise for the learning process. This noise accounts for being less precise in capturing the different effects of a subtask when the subtask name is associated with different task types.
For instance, both tasks ^{′}Create^{′} and ^{′}AdminEdit^{′} have a subtask called ^{′}Commit^{′}. However, the subtask ^{′}Commit^{′} associated with the task ^{′}Create^{′} typically implies more objects being stored in the SUT database, hence it is more “costly” than ^{′}Commit^{′} associated with ^{′}AdminEdit^{′}.

Using as feature the multiplication of CumulativeObjSize by a boolean variable (True = 1,False = 0) indicating whether a request requires SUT database access or not, instead of using CumulativeObjSize alone as feature for all requests, further improves the model performance by ca. 10%. This seems also reasonable since using CumulativeObjSize for all requests during the learning process, independently whether a request requires database access or not, would clearly introduce noise which accounts for giving weight to a property (database fill level) even if the response time for a request does not depend on that property.
Eventually, we define a list of features including both raw features, i.e., request attributes as recorded in the logfiles, and engineered features, i.e., combinations of raw features. The values of the features in the logfiles might be integers, floats, or strings. Note that missing values do not occur at this time in the data as they have been previously resolved during the data cleaning and preprocessing phase. That is, they have been either filled with default values or removed together with the corresponding rows.
A feature whose values are strings or ordinal numbers encoding some categories is called categorical feature. The possible values of a categorical feature are the categories of that feature. Since the multiple linear regression algorithm that we are going to use next can only handle numerical features for computing appropriate mathematical operations, we need to transform categorical features into equivalent numerical features.
We selected the multiple linear regression algorithm (see Section 3) to learn the prediction model for response time distributions from the logfiles. This is a good match for the specific model requirements R.1−3 highlighted above. Moreover, the learned model makes good enough predictions for the purpose of the work presented in this paper (see Section 6).
We use the programming language python 2.7 with the scikitlearn^{6} 0.19.1 machine learning package for model prototyping, e.g., testing of various combinations of features, estimating the accuracy of an algorithm using kfold cross validation, etc. Once we identified the set of features with the highest predictive power, we use the StatsModels^{7} 0.8.0 package to generate a deployment model candidate from the entire dataset.
Scikitlearn follows the machine learning tradition where the main supported task is choosing the best model for prediction. That is, the emphasis in the supporting infrastructure in scikitlearn is on model selection for best predictions of new, previously unseen samples with crossvalidation on test data.
StatsModels follows the statistics tradition where we want to know how well a model fits the data, which features explain or affect the labeled variable, or what the size of the effect is. That is, the emphasis in the supporting infrastructure of StatsModels is on analyzing the training data (hypothesis tests) and deriving complex statistical properties of resulting estimators, e.g., standard errors, p values, etc. This points out the distinction between StatsModels and scikitlearn. Thus, while there is a lot of overlap, e.g., StatsModels also does prediction, it is easier to use the crossvalidation support of scikitlearn for performing crossvalidation for prediction, whereas it is easier to use the statistics support of StatsModels for generating the parameters \((\mu _{\beta _{k}}, \sigma _{\beta _{k}})\) required by the model (7) from Section 3.
Model predictive power evaluation
The dataset used to train a machine learning algorithm is called training dataset. The latter cannot be used to give reliable estimates of the accuracy of the model on new data. This is a big problem because the whole idea of creating the model is to make predictions on new data. However, we use statistical methods called resampling methods to split the dataset up into subsets; some are used to train the model and others are held back and used to estimate the accuracy of the model on unseen data. That is, we split the input dataset into training and test sets; we train the model on the training set and estimate its accuracy on the test set (Hastie et al. 2009).
To reduce the possible effect that the model performs well just by chance with a selected train/test split, we estimate the accuracy of the modeling algorithm using kfold cross validation. More precisely, we randomly partition the input dataset into k equal sized subsets. Of the k subsets, we hold back a single subset as test set and use the remaining k1 subsets as training set. We repeat the crossvalidation process k times, with each of the k subsets used as test set exactly once. If each time we obtain comparable results, we conclude that it is unlikely that they are due to chance. Typical values for k are 5 or 10 but in general k remains an unfixed parameter and depends on the size of the input dataset.
To ensure a better model stability and robustness, it is generally recommended to apply feature scaling, e.g., normalization or standardization, to the data before training an MLR algorithm on the data. Recall that normalization is rescaling the range of features to [0, 1] or [−1, 1] and standardization is making the values of each feature in the data have zeromean and unitvariance. However, whether the model benefits from feature scaling or not strongly depends on the data. Therefore, in order to verify whether standardizing or normalizing the data makes a difference in our case, we perform the model evaluation in turn on the raw, normalized, and standardized data, and compare the results.
We use the LinearRegression class from the scikitlearn package to train and test an MLR model with the kfold crossvalidation technique, e.g., k = 5. We generally obtain R^{2}scores in the range [0.75,0.95], depending on the complexity of the SUT property to test which determines the experimental setup generating the logfiles. We obtain similar R^{2}scores on both train and test sets at all iterations of the kfold crossvalidation which indicates that the model does not suffer from overfitting, i.e., scoring well on the training but badly on the test data.
Importantly, we do not see any significant difference in the above results, whether we make the evaluation on the raw, normalized, or standardized data. This is good news since we can avoid feature scaling without any loss, while contributing at the same time to the fulfillment of the requirements R.2−3, as we will see in the next phase below.
It is worth mentioning that other algorithms (e.g., polynomial regression, random forests) have been tested as well. However, the achieved R^{2}scores were just slightly better so that the increased complexity was not worth it. Further investigations would be necessary, especially with regard to the fulfillment of the request R.1. This might be subject of future work if, while using this work in practice, it turns out that better models are needed for the statistical model checking part.
Deployment model generation
We use the OLS class from the StatsModels package to generate a deployment model from the entire dataset. We no longer need the crossvalidation models or the train/test sets from the previous phase since they have just served to get the confidence that we can get a good model from the logfiles. Once we have got this confidence, we can use the entire dataset to generate the model.
The parameters \((\mu _{\beta _{k}}, \sigma _{\beta _{k}})\) required by the model (7) belong to the statistics that OLS delivers outofthebox along with the model. Thus, the requirement R.1 is fulfilled. In order to fulfill the requirements R.2−3, we abstain from applying feature scaling to the data before training the MLR algorithm to learn the deployment model. The reason is that feature scaling would negatively impact both (i) the model performance when it is used to make predictions at runtime (feature scaling has to be applied also at runtime to the data for which we want to make predictions), and (ii) the model portability (the same features transformation used at model training time needs to be performed also at runtime). Moreover, as showed in the model predictive power evaluation phase, feature scaling does not really make a difference in our case. The simple, linear form of the model (7) contributes to the fulfillment of the requirements R.2−3 as well.
Listing 1 shows an excerpt from a deployment model candidate generated with OLS. In the left column are the intercept and the regressor variables corresponding to [x_{0},...,x_{p}] (including both raw and dummy binary features) from the model (7). The second column shows the estimates of means corresponding to the parameters \(\mu _{\beta _{k}}\). The third column shows the empirical standard errors of the estimates of means, corresponding to the parameters \(\sigma _{\beta _{k}}\). The fourth column contains the t values, i.e., the ratio of estimate and standard error. The p values in the last column describe the statistical significance of the estimates: low p values indicate high significance.
We use the p values to refine the model by excluding features with low significance. For instance, #Attributes in Listing 1 has a high p value, that is, its estimate has low significance. If we generate a new model, this time without the feature #Attributes, we obtain a similar R^{2}score but with a simpler model. We typically iterate several times through the model generation phases by adding/removing features until a good simplicity/R^{2}score tradeoff of the model is reached.
The function \(\mathit {sample} : \mathbb {R}_{>0} \times \mathbb {R}_{>0} \rightarrow \mathbb {R}_{>0}\), takes as input the pair (μ_{y},σ_{y}) from the function cost and draws a value from the normal distribution \(\mathcal {N}(\mu _{y}, {\sigma _{y}^{2}})\). It returns this value as the prediction for the response time y of the request r.
4.3 Model simulation and verification of the SUT
In order to simulate the interaction of users with the SUT, it is also necessary to model the typical behavior of users. A user profile specifies the typical behavior of a class of users, e.g., the frequency of task executions, the pauses between tasks and the time needed to input data.
This user profile is joined with the cost model in order to obtain a combined model that can be applied to simulate a user. A user population is simulated by executing this model concurrently within one of our SMC properties, which were explained in Section 2.4. The combined model has the semantics of a stochastic timed automaton (Ballarini et al. 2013). Note, for these waiting times, we also introduce states in a similar way as for the subtasks, as illustrated in Fig. 4.
In order to estimate the probability of responsetime properties, we perform a Monte Carlo simulation with ChernoffHoeffding bound. However, this simulation requires too many samples to be efficiently executed on the SUT and so we only run it on the model. For example, checking the probability that the response time of all subtasks is under a threshold of 50 ms for each user of a population of 20 users with parameters 𝜖 = 0.05 and δ = 0.01, requires 1060 samples and returns a probability of 0.806, when a testcase length of four tasks is considered.
Fortunately, hypothesis testing requires fewer samples and is, therefore, better suited for the evaluation of the SUT. The probability that was computed on the model serves as a hypothesis to check, if the SUT is at least as good. We apply it as alternative hypothesis and select a probability of 0.556 as null hypothesis, which is 0.25 smaller, because we want to be able to reject the hypothesis that the SUT has a smaller probability. Additionally, we want to check if the probability of the SUT is not much higher than the probability computed on the model, in order to asses the prediction power of our model. Hence, we also test a probability of 1, which is about 0.2, as a null hypothesis with a second SPRT and the same alternative hypotheses. By running SPRT (with 0.01 as type I and II error parameters) for each user of the population, we can check these hypotheses.
The alternative hypotheses were accepted for both SPRTs and for all users which means that the model’s prediction was accurate. This verification phase on the SUT is indeed efficient: on average only 17.55 and 11 samples were needed for the first and second SPRTs, respectively.
5 Architecture and implementation
In this section, we show the integration of the cost models and user profiles into a collective model, and we illustrate how such models can be simulated with PBT.
We already presented an existing implementation of MBT with FsCheck (Aichernig and Schumi 2016a), which supports automatic formdata generation and EFSMs. Based on this work, we implemented the following extensions in order to support our new method. The first extension is a parser that reads the learned responsetime distributions and integrates them into the model. In the previous implementation, we had command instances, which represent the tasks and generators for different data types (for form data). Now, we introduce new cost generators for sampling costs or response times, which can be applied in the same way as normal generators for form data. During the testcase generation, the generated costs can be evaluated within the commands. This helps to check responsetime properties.
Next, the thread is put to sleep for the duration that was generated with the cost function. Then, the number of users is decreased again. Finally, the generated delay is returned so that it can be checked outside the generator. Note, this generator function also applies the generated delay. This is done, because we need to know the number of active users for the generation of a sample and in order to know which user is active it is necessary to directly execute this behavior, so that we have active users during the generation step. Multiple users are executed concurrently in different threads in an independent way. However, their shared variable ActiveUserNum causes a certain dependency between the user threads, because when one user increases this variable, then this affects the responsetime distributions of the other users.
The user profiles are also parsed and their user behavior is added into the combined model. The user inputdurations that represent the time needed for filling web forms can be integrated in a similar way as the cost functions by introducing inputduration generators. Their implementation details are omitted, as they work in the same way as cost generators except that they do not change the number of active users and they use a uniform distribution instead of a normal distribution.
6 Evaluation
We evaluated our method for a webservice application from the automotive domain, which was explained in Section 2.1 and we applied it to two major modules of this application, the Test Order Manager and the Test Equipment Manager. Their descriptions are based on a previous work (Aichernig and Schumi 2017b), where we performed classical PBT for these modules and we also presented the functional models in detail. Now, we present a performance evaluation of this system. We focus on the response times and the number of samples needed, and also present run times of the simulation and testing process.
Settings
The evaluation was performed in a distributed environment at AVL. The TFMS server (version 1.8) was running on a virtual machine with Windows Server 2012, 15 GB RAM and 7 Intel Xeon E52690v4 2.6 GHz CPUs. The test clients that simulated the users were executed in a separate virtual machine with Windows Server 2008, 6 GB RAM and 3 Intel Xeon E52690v4 2.6 GHz CPUs. The logs for the costmodel learning were created on these test clients and they were applied to evaluate our models. For both, the testcase generation for the logs and the simulation with SMC, we applied the PBT tool FsCheck version 2.8.2.
6.1 Test Order Manager
As expected, a decrease in the probability that the property holds can be observed, when the testcase length or the population size increases. Moreover, the size of the database has an important influence on the response times. We can see that the response times increase when the database size rises. The advantage of the simulation on the modellevel is that it runs much faster than on the SUT. With a virtual time of 1/10 of the actual time, we can perform simulations that would take days on the SUT within hours.
Test Order Manager results of the SUT evaluation with the SPRT
Threshold [ms]  #Users  Cumulative ObjSize  H _{1}  1. H_{0}  Result  #Samples  2. H_{0}  Result  #Samples  Runtime [min:s] 

50  5  0  0.728  0.478  H _{1}  20.6  0.978  1H_{1}4H_{0}  14.6  17:56 
50  25  0  0.653  0.403  H _{1}  11.76  0.902  H _{0}  29.48  51:30 
50  45  0  0.456  0.206  H _{1}  7.68  0.705  H _{0}  17  22:02 
100  5  0  0.997  0.747  H _{1}  16  –  –  –  11:22 
100  25  0  0.995  0.745  H _{1}  16  –  –  –  11:56 
100  45  0  0.986  0.736  H _{1}  17.84  –  –  –  22:08 
100  5  80,000,000  0.428  0.178  H _{1}  27  0.678  H _{1}  37.6  28:46 
100  25  80,000,000  0.425  0.175  H _{1}  35.56  0.675  H _{1}  29.16  38:54 
100  45  80,000,000  0.419  0.169  7H_{0}38H_{1}  62.82  0.669  H _{1}  18.22  55:37 
150  5  80,000,000  1  0.75  2H_{0}3H_{1}  12.8  –  –  –  10:25 
150  25  80,000,000  0.999  0.749  16H_{0}9H_{1}  9.64  –  –  –  22:40 
150  45  80,000,000  0.972  0.722  H _{0}  5.78  –  –  –  6:52 
200  5  80,000,000  1  0.75  H _{1}  16  –  –  –  11:32 
200  25  80,000,000  1  0.75  H _{1}  12.72  –  –  –  11:56 
200  45  80,000,000  1  0.75  H _{1}  13.82  –  –  –  32:46 
The table shows the hypotheses and evaluation results for different thresholds, different numbers of users and for the two database fill levels (CumulativeObjSize). As explained in Section 4.3, we perform two SPRTs, one to check if the SUT is not much worse than the model, and one to check if the SUT is not much better than the model. The alternative hypothesis H_{1} is produced via the model simulation and is the same in both SPRTs, but the null hypotheses are different (smaller or larger). As result, we report the accepted hypotheses and how often they were accepted, when it was not always the same hypothesis. Moreover, we show the number of samples that were needed for the SPRT (#Samples) and the run time of this evaluation. We only perform one SPRT if the predicted probability of the model is close to one or zero, because then we are already close enough to the min./max. probability.
Note that in order to obtain an average number of needed samples, we run the SPRT concurrently for each user of the population and calculate the average of these runs. Multiple independent SPRT runs would produce a better average, but the computation time was too high and we only had limited time in the test environment. Compared to the execution on the model, a smaller number of samples is needed, as the SPRT stops, when it has sufficient evidence.
We can see that in many cases, the alternative hypotheses were accepted, which means that the predicted probability was close enough to the real probability of the SUT. In some cases the null hypothesis was accepted, which means that our model was too optimistic or pessimistic in these cases. We will discuss this later in Section 7.
Moreover, it is apparent that the smaller number of required samples of the SPRT (max. ca. 62) compared to Monte Carlo simulation (1060 samples) allowed us to analyze the SUT within a feasible short time. For example, in the worst case it took only about an hour to apply the SPRT.
6.2 Test Equipment Manager
The Test Equipment Manager is another important module of our SUT. This module enables the administration of equipment that is relevant for the test beds, like measurement devices, sensors, actuators, and various input/output modules. All these test equipment can be created, edited, calibrated, and maintained. A hierarchy of test equipment types is used to classify the test equipment. Test configurations, which are compositions of different test equipment, can also be administrated. The connection of devices via channels can be controlled with this module too.
Test Equipment Manager results of the SUT evaluation with the SPRT
Threshold[ms]  #Users  Cumulative ObjSize  H _{1}  1. H_{0}  Result  #Samples  2. H_{0}  Result  #Samples  Runtime [min:s] 

50  5  0  0.973  0.723  H _{1}  19.6  –  –  –  10:52 
50  25  0  0.936  0.686  H _{1}  16.2  –  –  –  9:46 
50  45  0  0.671  0.421  H _{1}  11.111  0.921  H _{0}  18.578  13:11 
100  5  0  1  0.75  H _{1}  16  –  –  –  5:46 
100  25  0  0.998  0.748  H _{1}  16  –  –  –  6:10 
100  45  0  0.962  0.712  H _{1}  16  –  –  –  7:49 
100  5  30,000,000  0.013  0.125  H _{0}  14  0.625  H _{1}  9  3:39 
100  25  30,000,000  0.114  –  –  –  0.364  H _{1}  14.4  2:59 
100  45  30,000,000  0.014  –  –  –  0.264  H _{1}  16.244  3:11 
150  5  30,000,000  0.999  0.749  H _{0}  5  –  –  –  3:19 
150  25  30,000,000  0.82  0.57  H _{0}  6  0.82  H _{1}  5  3:55 
150  45  30,000,000  0.137  0.387  H _{0}  14.266  –  –  –  3:15 
200  5  30,000,000  1  0.75  3H_{0}2H_{1}  12.8  –  –  –  4:41 
200  25  30,000,000  0.997  0.747  H _{0}  5.68  –  –  –  4:58 
200  45  30,000,000  0.496  0.246  H _{0}  13.555  0.746  H _{1}  7.688  6:00 
6.3 Run times of the method
Our method consists of several phases that have different computation times. Here, we give an overview of the timings of these phases in order to illustrate the overall run time of our method and to demonstrate its effectiveness.
In the first step, we generate log data with modelbased testing. This initial testing phase took about an hour for both our tested modules, i.e., about 63 min. for the Test Order Manager and about 65 min. for the Test Equipment Manager. The next step was the costmodel learning, which took only about 70 to 100 seconds including the time for data cleaning and preprocessing.
Average simulation time [min:s] of the model for the Test Order Manager and the Test Equipment Manager for an empty and filled database
#Users  Testcase length  Simul. time (empty DB)  Simul. time (filled DB)  

Test Order Manager  Test Equipment Manager  Test Order Manager  Test Equipment Manager  
5  3  9:24  6:40  9:23  6:40 
25  3  9:31  6:51  9:41  6:51 
45  3  9:37  7:08  9:45  7:08 
5  4  12:46  8:56  12:45  8:57 
25  4  12:52  9:09  12:58  9:09 
45  4  13:02  9:37  13:04  9:37 
The last columns of Tables 2 and 3 show the run times of the SPRTs. Note that during the execution of a sample, we stopped when we already observed a higher response time than our threshold, and we only have one run time for both SPRTs, since we check them in one execution. The run times of the Test Order Manager were about 1h in two cases. All other cases were mostly shorter than half an hour and the best cases were about 10 min. The run times of the Test Equipment Manager were shorter due to less complexity. They were always below 15 min and in the best cases about 3 min.
Executing the Monte Carlo simulation that we applied for the model directly on the SUT would take about one day. By applying the SPRT, we can perform such an evaluation within less than an hour in the worst case.
7 Discussion
 (1)
Measurement errors. Some noise factors, e.g., variable network latency, memory cache misses, blocking effects of the SUT, etc., might have unevenly increased artificially the actual response times recorded in the logfiles. We could still obtain a reasonably high R^{2}score if by chance we were able to identify some linear dependencies in the logdata. However, the predictions at runtime are not as good as indicated during the model predictive power evaluation phase simply because the same noise factors did not apply also at runtime.
Measurement errors are significantly lower with a nondistributed environment setup where our method generally achieve better results. For this paper, we selected however the less favorable case of the distributed environment setup.
 (2)
Sampling bias. The simulation for generating the logfiles might unintentionally be designed and set up in such a way that all relevant scenarios were not equally likely to be simulated. That is, the logfiles do not contain equally many examples of all relevant scenarios. Thus, the model “learns” only the dominant dependencies available in the logfiles and fails to make good predictions for samples with dependencies not represented in the logfiles.
Additionally, some false dependency might be derived from the (biased) logfiles which does not hold in general. For instance, if the number of concurrently active users is monotonically increased instead of being randomly selected during the simulation for generating the logfiles, then a misleading positive correlation between the number of active users and the database size arises which does not hold in general. While carefully analyzing the logdata, e.g., by means of diagrams like correlation matrices and scatter plots, helps to reduce the sampling bias risk, we generally cannot avoid it completely.
A threat to the validity might be that one case study with only a specific system cannot show the applicability or generality of our approach. In order to resolve this threat, we have now also applied our method to another application domain, i.e. for a performance comparison of different MQTT brokers (Aichernig and Schumi 2018). However, evaluations in further application areas would still be interesting future work.
An interesting observation, which might be seen also as a weakness of our approach, is that SMC seems to be inefficient when the given threshold of the responsetime property to be tested is far below or far above the actual response time. In these cases, the probability of the responsetime property does not vary in a significant way with the user population size. SMC wastefully computes the probability for various user population sizes, even if a single run with a fixed user population size, say one user, would be sufficient to get a similar result. This phenomenon can be clearly observed in Fig. 10, where the probability curves of different user population sizes are very close to each other for low and high thresholds, whereas they only go apart for thresholds close to the actual response times where the user population size seems to make a difference.
Finally, efforts to improve the prediction model accuracy, e.g., through nonlinear learning methods, might be subject of future work if, while using the presented method in practice, it turns out that better prediction models are needed.
8 Conclusion
We have demonstrated that we can exploit PBT features in order to check responsetime properties under different user populations both on a modellevel and on an SUT. With SMC, we can evaluate stochastic cost models and check properties like, what is the probability that the response time of a user within a population is under a certain threshold? We also showed that we can test the accuracy of such probability estimations on the SUT without the need for an extra tool. A big advantage of our method is that we can perform simulations, which require a high number of samples on the model in a fraction of the time that would be required on the SUT. Moreover, we can check the results of such simulations on the SUT by applying the SPRT, which needs fewer samples. Another benefit lies in the fact that we simulate inside a PBT tool. This facilitates the model and property definition in a highlevel programming language, which makes our method more accessible to testers from industry.
We have evaluated our method by applying it to an industrial webservice application from the automotive industry and the results were promising. First, we presented the learning process for our cost models in detail. Then, we showed that we can apply these cost models to derive probabilities for responsetime properties for different population sizes and that we can evaluate these probabilities on the real system with a smaller number of samples. In principle, our method can be applied outside the web domain, e.g., to evaluate runtime requirements of realtime or embedded systems. However, for other applications and other types of costs alternative costlearning techniques (Hastie et al. 2009; West et al. 2006) may be better suited.
In the future, we plan to apply our cost models for stress testing as they help to find subtasks or attributes that are more computationally expensive than others.
Moreover, we intend to apply our method to evaluate different versions of the SUT, i.e., to perform nonfunctional regression testing.
Footnotes
Notes
Funding information
Open access funding provided by Graz University of Technology. This work was funded by the Austrian Research Promotion Agency (FFG), project no. 845582 Trust via cost function driven modelbased test case generation for nonfunctional properties of systems of systems (TRUCONF).
References
 Aichernig, B.K., & Schumi, R. (2016a). Propertybased testing with FsCheck by deriving properties from business rule models. In Ninth IEEE international conference on software testing, verification and validation workshops, ICST Workshops 2016, Chicago, IL, USA, April 1115, 2016 (pp. 219–228). IEEE Computer Society.Google Scholar
 Aichernig, B.K., & Schumi, R. (2016b). Towards integrating statistical model checking into propertybased testing. In 2016 ACM/IEEE international conference on formal methods and models for system design, MEMOCODE 2016, Kanpur, India, November 1820, 2016 (pp. 71–76). IEEE.Google Scholar
 Aichernig, B.K., & Schumi, R. (2017a). Statistical model checking meets propertybased testing. In 2017 IEEE International conference on software testing, verification and validation, ICST 2017, Tokyo, Japan, March 1317, 2017 (pp. 390–400). IEEE Computer Society.Google Scholar
 Aichernig, B.K., & Schumi, R. (2017b). Propertybased testing of web services by deriving properties from businessrule models. Software & Systems Modeling.Google Scholar
 Aichernig, B.K., & Schumi, R. (2018). How fast is MQTT?  Statistical model checking and testing of IoT protocols. In Quantitative evaluation of systems  15th international conference, QEST 2018, Beijing, China, September 47, 2018, proceedings, volume 11024 of lecture notes in computer science (pp. 36–52). Springer.Google Scholar
 Arts, T. (2014). On shrinking randomly generated load tests. In Proceedings of the Thirteenth ACM SIGPLAN workshop on Erlang, Gothenburg, Sweden, September 5, 2014 (pp. 25–31). ACM.Google Scholar
 Ballarini, P., Bertrand, N., Horvȧth, A., Paolieri, M., Vicario, E. (2013). Transient analysis of networks of stochastic timed automata using stochastic state classes. In Quantitative evaluation of systems  10th international conference, QEST 2013, Buenos Aires, Argentina, August 2730, 2013. Proceedings, volume 8054 of lecture notes in computer science (pp. 355–371). Springer.Google Scholar
 Balsamo, S., Marco, A.D., Inverardi, P., Simeoni, M. (2004). Modelbased performance prediction in software development: a survey. IEEE Transactions on Software Engineering, 30(5), 295–310.CrossRefGoogle Scholar
 Becker, S., Koziolek, H., Reussner, R.H. (2009). The Palladio component model for modeldriven performance prediction. Journal of Systems and Software, 82 (1), 3–22.CrossRefGoogle Scholar
 Book, M., Gruhn, V., Hu̇lder, M., Kȯhler, A., Kriegel, A. (2005). Cost and response time simulation for webbased applications on mobile channels. In Proceedings fifth international conference on quality software (QSIC 2005), 19–20 September 2005, Melbourne, Australia (pp. 83–90). IEEE Computer Society.Google Scholar
 Bulychev, P.E., David, A., Larsen, K.G., Mikucionis, M., Poulsen, D.B., Legay, A., Wang, Z. (2012). UPPAALSMC: statistical model checking for priced timed automata. In Proceedings 10th workshop on quantitative aspects of programming languages and systems, QAPL 2012, Tallinn, Estonia, 31 March and 1 April 2012., volume 85 of EPTCS (pp. 1–16). Open Publishing Association.Google Scholar
 Claessen, K., & Hughes, J. (2000). QuickCheck: a lightweight tool for random testing of Haskell programs. In Proceedings of the Fifth ACM SIGPLAN international conference on functional programming (ICFP’00), Montreal, Canada, September 1821, 2000 (pp. 268–279). ACM.Google Scholar
 Claessen, K., Palka, M.H., Smallbone, N., Hughes, J., Svensson, H., Arts, T., Wiger, U.T. (2009). Finding race conditions in Erlang with QuickCheck and PULSE. In Proceeding of the 14th ACM SIGPLAN international conference on functional programming, ICFP 2009, Edinburgh, Scotland, UK, August 31  September 2, 2009 (pp. 149–160). ACM.Google Scholar
 Draheim, D., Grundy, J.C., Hosking, J.G., Lutteroth, C., Weber, G. (2006). Realistic load testing of web applications. In Proceedings of the 10th European conference on software maintenance and reengineering (CSMR 2006) Bari, Italy, 2224 March 2006 (pp. 57–70). IEEE.Google Scholar
 Govindarajulu, Z. (2004). Sequential statistics. World Scientific.Google Scholar
 Grinchtein, O. (2008). Learning of timed systems. PhD thesis. Uppsala University, Sweden.Google Scholar
 Halili, E.H. (2008). Apache JMeter: a practical beginner’s guide to automated testing and performance measurement for your websites. Packt Publishing Ltd.Google Scholar
 Hastie, T., Tibshirani, R., Friedman, J.H. (2009). The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer.Google Scholar
 Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301), 13–30.MathSciNetCrossRefGoogle Scholar
 Hughes, J. (2007). QuickCheck testing for fun and profit. In Practical aspects of declarative languages, 9th international symposium, PADL 2007, Nice, France, January 1415, 2007, volume 4354 of lecture notes in computer science (pp. 1–32). Springer.Google Scholar
 Hughes, J., Pierce, B.C., Arts, T., Norell, U. (2016). Mysteries of Dropbox Propertybased testing of a distributed synchronization service. In 2016 IEEE International conference on software testing, verification and validation, ICST 2016, Chicago, IL, USA, April 1115, 2016 (pp. 135–145). IEEE Computer Society.Google Scholar
 Jaccard, J., Turrisi, R., Jaccard, J. (2003). Interaction effects in multiple regression. SAGE.Google Scholar
 Jinyuan, C. (2012). The application of load runner in software performance test. Computer Development & Applications, 5, 014.Google Scholar
 Kalaji, A.S., Hierons, R.M., Swift, S. (2009). Generating feasible transition paths for testing from an extended finite state machine (EFSM). In Second international conference on software testing verification and validation, ICST 2009, Denver, Colorado, USA, April 14, 2009 (pp. 230–239). IEEE Computer Society.Google Scholar
 Krishnamurthy, D., Rolia, J.A., Majumdar, S. (2006). A synthetic workload generation technique for stress testing sessionbased systems. IEEE Transactions on Software Engineering, 32(11), 868–882.CrossRefGoogle Scholar
 Legay, A., & Sedwards, S. (2014). On statistical model checking with PLASMA. In 2014 Theoretical aspects of software engineering conference, TASE 2014, Changsha, China, September 13, 2014 (pp. 139–145). IEEE Computer Society.Google Scholar
 Legay, A., Delahaye, B., Bensalem, S. (2010). Statistical model checking: an overview. In Runtime verification  first international conference, RV 2010, St. Julians, Malta, November 14, 2010. Proceedings, volume 6418 of lecture notes in computer science (pp. 122–135). Springer.Google Scholar
 Lu, Y., Nolte, T., Bate, I., CucuGrosjean, L. (2012). A statistical responsetime analysis of realtime embedded systems. In Proceedings of the 33rd IEEE realtime systems symposium, RTSS 2012, San Juan, PR, USA, December 47, 2012 (pp. 351–362). IEEE Computer Society.Google Scholar
 Menascė, D.A. (2002). Load testing of web sites. IEEE Internet Computing, 6 (4), 70–74.CrossRefGoogle Scholar
 Nilsson, R. (2014). ScalaCheck: the definitive guide. IT Pro Artima Incorporated.Google Scholar
 Norell, U., Svensson, H., Arts, T. (2013). Testing blocking operations with QuickCheck’s component library. In Proceedings of the Twelfth ACM SIGPLAN Erlang Workshop, Boston, Massachusetts, USA, September 28, 2013 (pp. 87–92). ACM.Google Scholar
 Papadakis, M., & Sagonas, K. (2011). A PropEr integration of types and function specifications with propertybased testing. In Proceedings of the 10th ACM SIGPLAN workshop on Erlang, Erlang’11, Tokyo, Japan, September 23, 2011 (pp. 39–50). ACM.Google Scholar
 Pooley, R., & King, P. (1999). The unified modelling language and performance engineering. IEE ProceedingsSoftware, 146(1), 2–10.CrossRefGoogle Scholar
 Rencher, A., & Christensen, W. (2012). Methods of multivariate analysis. Wiley series in probability and statistics, 3rd edn. Wiley.Google Scholar
 Rina, & Tyagi, S. (2013). A comparative study of performance testing tools. International Journal of Advanced Research in Computer Science and Software Engineering Research, 3(5), 1300–1307.Google Scholar
 Schumi, R., Lang, P., Aichernig, B.K., Krenn, W., Schlick, R. (2017). Checking responsetime properties of webservice applications under stochastic user profiles. In Testing software and systems  29th IFIP WG 6.1 International Conference, ICTSS 2017, St. Petersburg, Russia, October 911, 2017, Proceedings, volume 10533 of lecture notes in computer science (pp. 293–310). Springer.Google Scholar
 Smith, C.U. (1990). Software performance engineering tutorial. In 16th International Computer Measurement Group Conference, December 1014, 1990, Orlando, FL, USA, proceedings (pp. 1311–1318). Computer Measurement Group.Google Scholar
 Smith, C.U., & Williams, L.G. (1997). Performance engineering evaluation of objectoriented systems with SPE⋅ED. In Computer performance evaluation: modelling techniques and tools, 9th International Conference, St. Malo, France, June 36, 1997, proceedings, volume 1245 of lecture notes in computer science (pp. 135–154). Springer.Google Scholar
 Tang, J., Alelyani, S., Liu, H. (2014). Feature selection for classification: a review. In Data classification: algorithms and applications (pp. 37–64). CRC Press.Google Scholar
 Verwer, S., de Weerdt, M., Witteveen, C. (2010). A likelihoodratio test for identifying probabilistic deterministic realtime automata from positive data. In Grammatical inference: theoretical results and applications, 10th international colloquium, ICGI 2010, Valencia, Spain, September 1316, 2010. Proceedings, volume 6339 of lecture notes in computer science (pp. 203–216). Springer.Google Scholar
 Vinayak Hegde, P.M.S. (2014). Web performance testing: methodologies, tools and challenges. International Journal of Scientific Engineering and Research (IJSER), 2.Google Scholar
 Wald, A. (1973). Sequential analysis. Courier Corporation.Google Scholar
 West, B.T., Welch, K.B., Galecki, A.T. (2006). Linear mixed models. CRC Press.Google Scholar
 Woodside, C.M., Franks, G., Petriu, D.C. (2007). The future of software performance engineering. In International Conference on Software Engineering, ISCE 2007, Workshop on the Future of Software Engineering, FOSE 2007, May 2325, 2007, Minneapolis, MN, USA (pp. 171–187). IEEE Computer Society.Google Scholar
 Wright, S. (1921). Correlation and causation. Journal of Agricultural Research, 20, 557–585.Google Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.