1 Introduction

Software testing is an important phase of software development, requiring a lot of resources like time and equipment. This is especially true for large companies that employ many software developers and use constant regression testing. Thus, there is a need to automate the testing process and optimize it in order to reduce the resources needed for testing.

Software testing systems are sometimes implemented using cloud computing. Such systems are often referred to as Testing-as-a-Service (or TaaS) and are an extension of the Software-as-a-Service (or SaaS) cloud computing model. TaaS systems can be very complex: one example of a complete architecture of a TaaS system can be found in paper by Yu et al. [11]. Software testing over cloud is a relatively new concept, gaining popularity over the last decade. This resulted in a number of papers, topics ranging from discussion and challenges of TaaS [2] to practical considerations [7]. The idea of TaaS systems is still expanding and new possibilities are discovered and researched. For example, Tsai et al. suggested that thanks to multi-machine cloud environment TaaS systems can be effectively employed for combinatorial testing and presented a TaaS design based on Test Algebra and Adaptive Reasoning [10]. For a more thorough reading on general software testing over cloud please consider paper by Inçki et al. [3].

One of the components of TaaS system is distribution of the workload (software test) over the available cloud resources, which is understood as a form of cloud scheduling (or load balancing). Workload distribution in cloud environments is an older and more mature field of research with many approaches including metaheuristic methods like Ant Colony Optimization [6] or fuzzy sets [9]. For further reading on various approaches to scheduling and load balancing in cloud environments please refer to [1, 4].

In this paper we consider a TaaS system described in paper [5]. This system uses proprietary off-premises Amazon-compatible cloud service based on Eucalyptus software and served by an external vendor. The cloud is enhanced by on-premises software testing toolset. Because of the complexity of the system, we focus on the problem of distribution of workload among the cloud machines to minimize the time required for testing (flowtime). The first aim of this paper is developing a mathematical model for describing and predicting the activity of the system. Once such model has been constructed, it could be used to generate problem instances of arbitrary length. Sufficient number of such instances would then allow to reliably compare effectiveness of constructive algorithms, which is the second aim of this paper.

The remainder of this paper is structured as follows. Section 2 contains mathematical formulation of the problem as well a number of parametric and emprical distributions for modeling system parameters. Section 3 briefly describes a problem instance generator based on developed models. Section 4 describes constructive algorithms consdiered in this paper. Section 5 contains the description of computer experiment and results of the comparison of aforementioned constructive algorithms. Finally, Sect. 6 contains conclusions.

2 Mathematical Models

In the first part of this section we will present the mathematical model of the considered TaaS system in general. Such description can be written from several points of view, including (1) queueing theory, (2) load balancing and (3) scheduling. Due to certain properties of the system (like job weights, setup times and different numbers of operations per jobs) we have decided to present the considered problem as a variation of the online scheduling problem. In the second and third part of the section we will present some statistical and empirical models using the data accumulated from the work of the real-life system.

2.1 Problem Formulation

There is a set \(\mathcal {J} = \{ 1 , 2 , \dots {} , n \}\) of n jobs (which represent test suites – groups of software tests cases to be tested). Each job is made up of operations (which represent individual test cases). Let \(\mathcal {O}_j = \{ 1 , 2 , \dots {} , o_j \}\) be set of operations of job j. The size of this set is \(o_j\). Next, let \(p^i_j\), \(t^i_j\), \(r^i_j\) be the processing time, type and arrival (ready) time of i-th operation of j-th job respectively.

If two operations are from the same job, they have the same arrival time. Moreover, if two operations have the same type, they have the same processing time (this represent the same piece of code being scheduled for testing twice). In other words if a and b are operations from jobs j and k respectively then:

$$\begin{aligned} j = k \implies r^a_j = r^b_k , \end{aligned}$$
(1)
$$\begin{aligned} t^a_j = t^b_k \implies p^a_j = p^b_k . \end{aligned}$$
(2)

Next, let \(\mathcal {M} = \{ 1 , 2 , \dots {} , m \}\) be a set of m identical machines. Each operation has to be assigned to one machine. Any machine can process any operation. However, one operation can be processed by at most one machine at a time and one machine can process at most one operation at a time. The processing of operations cannot be interrupted (no preemption). Moreover, processing of each operation has to be preceded by setup time. This represents the time needed to transfer necessary files to the given machine. Setup time is 30 s (this was measured to be enough to transfer the needed file to the target machine) if the machine has not processed any operations yet or if the previously processed operation was of different job. Otherwise the setup time is 0 s (e.g. all files required for the next operation are already present on the target machine).

Let \(s^i_j\) be the starting time of processing of i-th operation of job j. Then the completion time \(c^i_j\) of that operation is given as:

$$\begin{aligned} c^i_j = s^i_j + p^i_j . \end{aligned}$$
(3)

The goal is to choose such starting times for all operations as to minimize the average flowtime of all jobs \(\bar{f}\), given as:

$$\begin{aligned} \bar{f} = \frac{1}{n} \sum _{j = 1}^{n} \; \max _{i \in \mathcal {O}_j} \; c^i_j - r^i_j \; . \end{aligned}$$
(4)

It is important to notice that the considered problem is online. This means that the information about jobs and operations become available only when they arrive. Especially, number of jobs is generally unknown (in fact, jobs are expected to continue to arrive indefinitely). Moreover, processing time \(p^i_j\) is only known if another operation of the same type have already finished processing. Otherwise, \(p^i_j\) is unknown even after operation arrival. Let us also notice, that operations do not have to be assigned to machine immediately after their arrival and can be kept in a queue. This, and the half-known processing times, distinguish this problem from simple load balancing.

2.2 Job Arrival Time Estimation

For the purpose of modeling job arrival time distribution, data from 5 month Monday–Friday work of a real-life TaaS system was gathered. This data contained \(177\,167\) jobs. The aggregated (each point represents sum of 600 s) histogram of in-day arrival time of jobs is shown in Fig. 1. Two distinct bell-like peaks are clearly visible. Our hypothesis is that this is caused by two large distinct teams working in different timezones. There is also a third team, but it is so small that it is negligible. It is also important to note that due to the specific properties of the system and its automation it is very difficult to determine which job was run by which team. This forces us to treat the distribution of arrival times as mixture distribution (multimodal distribution), instead of estimating each peak separately.

Fig. 1.
figure 1

Histogram of in-week job arrival time

We considered parametric models from 5 distribution families:

  1. 1.

    Bimodal normal distribution (5 parameters):

    $$\begin{aligned} N(x\,|\,\mu _1,\sigma _1,\mu _2,\sigma _2,\alpha ) = \alpha \mathcal {N} (x\,|\,\mu _1,\sigma ^2_1) + (1 - \alpha ) \mathcal {N} (x\,|\,\mu _2,\sigma ^2_2), \end{aligned}$$
    (5)

    where \(\mathcal {N} (x\,|\,\mu ,\sigma ^2)\) is normal distribution with location parameter \(\mu \) and scale parameter \(\sigma \) at point x.

  2. 2.

    Bimodal Weibull distribution (7 parameters):

    $$\begin{aligned}&W(x\,|\,k_1,\lambda _1,o_1,k_2,\lambda _2,o_2,\alpha ) \nonumber \\&\qquad \qquad \qquad \qquad = \alpha \mathcal {W} (x-o_1\,|\,k_1,\lambda _1) + (1 - \alpha ) \mathcal {W} (x-o_2\,|\,k_2,\lambda _2), \end{aligned}$$
    (6)

    where \(\mathcal {W} (x\,|\,k,\lambda )\) is Weibull distribution with shape parameter k and scale parameter \(\lambda \) at point x. The offset parameters \(o_1\) and \(o_2\) are required to let Weibull distributions start later than at \(x=0\).

  3. 3.

    Uni-shaped bimodal Weibull distribution (6 parameters):

    $$\begin{aligned}&W_S(x\,|\,k,\lambda _1,o_1,\lambda _2,o_2,\alpha ) \nonumber \\&\qquad \qquad \qquad \qquad = \alpha \mathcal {W} (x-o_1\,|\,k,\lambda _1) + (1 - \alpha ) \mathcal {W} (x-o_2\,|\,k,\lambda _2). \end{aligned}$$
    (7)

    This is similar to the previous model, except \(k_1=k_2=k\). This model assumes that the underlying process is the same for both teams, hence the same shape for both modes.

  4. 4.

    Bimodal Gamma distribution (7 parameters):

    $$\begin{aligned}&\varGamma (x\,|\,k_1,\theta _1,o_1,k_2,\theta _2,o_2,\alpha ) \nonumber \\&\qquad \qquad \qquad \qquad = \alpha \mathcal {\varGamma } (x-o_1\,|\,k_1,\theta _1) + (1 - \alpha ) \mathcal {\varGamma } (x-o_2\,|\,k_2,\theta _2), \end{aligned}$$
    (8)

    where \(\mathcal {\varGamma } (x\,|\,k,\theta )\) is Gamma distribution with shape parameter k and scale parameter \(\theta \) at point x. The offset parameters have the same purpose as for Weibull model.

  5. 5.

    Uni-shaped bimodal Gamma distribution (6 parameters):

    $$\begin{aligned}&\varGamma _S(x\,|\,k,\theta _1,o_1,\theta _2,o_2,\alpha ) \nonumber \\&\qquad \qquad \qquad \qquad = \alpha \mathcal {\varGamma } (x-o_1\,|\,k,\theta _1) + (1 - \alpha ) \mathcal {\varGamma } (x-o_2\,|\,k,\theta _2). \end{aligned}$$
    (9)

    This is similar to the previous model, except \(k_1=k_2=k\).

The parameters were estimated using a simple Monte Carlo method. Starting parameter ranges were set empirically, after which the method worked in stages. At each stage parameters were sampled and resulting distributions were evaluated. Each stage had tens to hundred thousands of samples. After each stage the range for each parameter was reduced, based on the best parameters found in that stage. The goodness of fit of the models was evaluated using the Kolmogorov-Smirnov (K-S) statistic. The results are shown in Table 1.

Table 1. Results of parameters estimation

We see that both Weibull models yielded the best results (K-S statistic below 0.01), while Gamma distributions had the worst results. Let us also note that the use of K-S statistic is strictly for comparison of models, as all of the proposed models were rejected by the Kolmogorov-Smirnov test: for \(177\,167\) observations (jobs), the critical value was 0.003226327, which is 3 times smaller than the value obtained from the \(W_S\) model. We suspect that this is caused by insufficient data – while there are over \(177\,167\) data points, there are also \(86\,400\) possible values for each observation, yielding slightly over 2 observations per value on average.

Another hypothesis was made, stating that a better fit could be achieved by modeling lunch breaks of employees – we believed those lunch breaks were responsible for visible drop of activity at the top of the bigger peak in Fig. 1. However, attempts to fit the data to 4-modal distribution (2 modes for big peak and 2 for small peak) did not yield better result, suggesting the underlying process is more complex (i.e. more non-lunch breaks). Unfortunately, the available company data does not contain information that could help model the breaks (especially for the second team).

Fig. 2.
figure 2

Histogram of operations per job

Non-parametric statistic models were also used and achieved a good fit (passing Kolmogorov-Smirnov test) for 23-modal normal distribution (68 parameters in total). However, the number of modes was different for big and small peak. Moreover, the resulting model was sensitive to any changes resulting from adding more observations. This, coupled with large number of parameters, made us disregard that model.

To summarize, we choose \(W_S\) (a 6-parameter uni-shaped bimodal Weibull distribution) as a model of system activity (job arrival times) based on the currently possessed TaaS system data.

2.3 Empirical Distributions

Aside from arrival times, other system properties have to be modeled. This includes: (1) operation processing time, (2) operation type and (3) number of operations per job. However, those properties are less regular and thus difficult to model. For example, histogram of number of operations per job (clipped for readability) is shown in Fig. 2. Because of this at present time those properties are modeled through empirical distributions calculated directly from accumulated system data. This time the data set contained \(182\,592\) jobs and \(21\,987\,210\) operations.

3 Instance Generator

In this short section we will describe the generator of problem instances developed from the models shown in Sect. 2 and the benefits it grants over using data gathered directly from the system.

The generator accepts two input parameters: (1) number of jobs to generate N and (2) number of jobs per day D. If \(N \equiv k \pmod D\), then the generated instance will have \(\lfloor \frac{N}{D} \rfloor \) days of D jobs and one day with k jobs. For \(k = 0\) this simplifies to \(\frac{N}{D}\) days, each with D jobs. On full days (e.g. number of jobs equal to D) the arrival time of each job is drawn from the \(W_S\) distribution presented earlier with no restraints. Thus, job has 0.9208 and 0.0792 chance to be drawn from big and small peak respectively. The resulting times (in-day seconds) are modified by offsets and then ensured to fit into [1; 86400] interval. In the case of non-full day (number of jobs \(k<D\)) the same procedure applies, but each arrival time has to fit into \([1; \lfloor \frac{86400k}{D}\rfloor ]\) interval. This mechanism ensures that generation of non-full days will have expected result. For example with \(N=1000\) and \(D=2000\) all resulting jobs will have arrival times of 43200 or less, simulating the first 12 h of a daily workload. Finally, for each job, its number of operations is drawn from empirical distribution constructed from the collected system data. Similarly, processing time and type of each operation are drawn from appropriate empirical distributions.

Ideally, the number of jobs for each day would be drawn from another distribution, but such a distribution is difficult to estimate from the system data. This is because the company is expanding and the number of jobs processed each day has been steadily increasing over time. Thus, further research or system analysis is required to estimate the distribution of the number of jobs processed daily. It should be noted, however, that the current approach still allows to generate problem instances matching any period in the company history provided that a correct value of parameter D is chosen.

4 Algorithms Description

In this section we briefly describe algorithms we considered for scheduling software tests on the cloud. Because of the system size (approximately \(240\,000\) operations scheduled daily) and short time available for scheduling (over 5 operations arriving each second during top activity), applying more advanced metaheuristic algorithms is somewhat difficult. Because of that simple priority-rule based constructive algorithms seem to be attractive alternative due to their low time complexity. Such algorithms were previously tested for a multi-criteria version of TaaS system, though tested problem instances were very limited in size [8].

All implemented algorithms work based on the concept of queue of operations – algorithms have to order this queue to decide which operations should be scheduled on the cloud first. Below is a brief description of implemented algorithms:

  1. 1.

    First In First Out (FIFO) – no rescheduling happens: jobs are scheduled in the order of their arrival and operations in jobs are scheduled in the order they appear in the job. Thus, this algorithm is the quickest with computational complexity of O(1).

  2. 2.

    Shortest Operation First (SOF) – operations in the waiting queue are sorted in the order of their processing times, with shorter operations being scheduled first. Job to which operation belongs does not matter.

  3. 3.

    Longest Operation First (LOF) – similar to SOF, but the sorting order is reversed, so longer operations are scheduled first.

  4. 4.

    Smallest Job Longest Operation First (SJLOF) – operations in the waiting queue are sorted, first according to the size of the job they belong to with shorter jobs going first and then according to processing times of operations inside equal jobs (with longer operations going first). Size of a job is understood as the total time of all unstarted operations of that job.

  5. 5.

    Largest Job Shortest Operation First (LJSOF) – similar to SJLOF, but sorting rules for jobs and operations are reversed.

  6. 6.

    Max – this algorithm first selects one of the jobs from the waiting queue (i.e. one of the jobs that has unstarted operations remaining). The jobs are selected cyclically in deterministic order in a round-robin fashion. An unstarted operation with the longest processing time of the selected job is then scheduled. This algorithm is based on the one currently employed in the real-life TaaS system considered in this paper.

  7. 7.

    Min – similar to Max, but shortest unstarted operation of a job is scheduled each time a job is selected.

  8. 8.

    Random – operations in the queue are ordered randomly. This is the only non-deterministic algorithm.

It should be noted that some of the algorithms have their counterparts not listed above. This includes SJSOF and LJLOF algorithms. Moreover, Earliest Longest Operation First (ELOF) and Earliest Shortest Operation First (ESOF) are two algorithms similar to FIFO. Those four algorithms were performing similarly to their counterparts listed above, thus the results presented in Sect. 5 only include the better or faster of them (meaning FIFO, SJLOF and LJSOF algorithms).

5 Computer Experiment

A computer experiment was performed to test the effectiveness of the algorithms described in Sect. 4. Such tests are usually difficult for two reasons. First, considered instances are large. With \(2\,000\) jobs per day, during rush hours (9 a.m. to 15 p.m) 2.5 jobs arrive each minute on average. 10-min instance would contain around \(3\,000\) operations. Computing optimal values for such instances is already difficult. Thus, it makes more sense to compare algorithms to each other by normalizing them to the best amongst them for a given instance.

Second problem is the number of possible problem instances and their variety – algorithms may perform very differently for highly unusual instances. However, in the case of the considered TaaS system, we know that the problem instances follow certain probability distributions and deviations are unlikely to happen. Even if data deviating from those distributions were too appear they will affect the system only temporary, letting it get back too normal during low-activity hours. In result, the algorithm comparison for the considered system can be done more reliably, using instances created from instance generator.

For testing purposes, a number of problem instances was generated using the generator described in Sect. 3. Considered problem sizes were 333 (simulating early morning, 4 h), \(1\,000\) (half a day), \(2\,000\) (a full day) and \(4\,000\) (two full days) jobs. For each problem size 50 instances were generated, for 200 instances in total. For each instance four different number of machines were considered: 100, 200, 400 and 800. This resulted in 800 actual problem instances. For each such instance 8 algorithms were run (for Random algorithm the presented results are average of 10 runs), resulting in 8 numbers. Let \(R_A(i,n,m)\) be the result of running algorithm A on i-th instance with n jobs and m machines. Also let \(R_{\mathrm {best}}(i,n,m)\) be the best (i.e. lowest average job flowtime) algorithm for the i-th instance of n jobs and m machines. Then:

$$\begin{aligned} R'_A(i,n,m) = \frac{R_A(i,n,m)}{R_{\mathrm {best}}(i,n,m)}, \end{aligned}$$
(10)

is normalized value of \(R_A(i,n,m)\). The results—averages of values of \(R'\) for various instance categories—are shown in Table 2. The most important values are highlighted in bold.

Fig. 3.
figure 3

Boxplots for all instances with a given number of machines

Table 2. Overall results of algorithm comparison. (nm) means the average was for all instances with n jobs and m machines. Star means that instances with all possible values were used
Fig. 4.
figure 4

Boxplots for all instances with a given number of jobs

First thing to note is that there are only two values of 1.0 in the table. This means that in general no algorithm is superior in all cases – almost always there is at least one instance for which a given algorithm will fail to outmatch all others. When all instances are considered (first row, marked as \((*,*)\)), SJLOF algorithm performs the best, being over \(25\%\) better than the second best algorithm overall – MIN. When the number of machines is increased, all algorithms perform better and the differences between them decrease. At 400 and 800 machines the MAX algorithm outperforms SJLOF. However, the difference is very small (around 0.6–0.8%) and by this time all algorithms have become similar. This is visible by the fact that for 400 and 800 machines FIFO and RANDOM algorithms are 0.8% away from the best algorithm on average as well. Let us note this is the only situation where average of any algorithm outperformed SJLOF. For lower number of machines SJLOF starts to outperform other tested algorithms greatly. When the number of jobs is considered, SJLOF algorithm outperforms other algorithms, being 16% to 37% better than the second best MIN algorithm. That advantage only increases with the number of jobs. We conclude that, when the average results are concerned, SJLOF performs the best and is the only algorithm that stays near the top in all situations.

Averaged results are not fully conclusive, thus we also present boxplots in Figs. 3 and 4. Each box represents the interquartile range, with the line inside the box representing the median. The whiskers span from the 5th to the 95th percentile. Values outside the whiskers are indicated with dots. When the number of machines is high enough (Fig. 3a), all algorithms are rarely more than 3 or 4% away from the best one (MAX in this case). Still, MAX and SJLOF algorithms clearly outperform the rest. More interesting is the case when the number of machines is small (Fig. 3b). We see that all algorithms besides SJLOF and MIN can perform from around 5 to even 50 times worse than the best algorithm in some cases. Moreover, those algorithms very rarely manage to get closer than 3 times away from the best algorithm. In short, SJLOF algorithm decidedly outperforms other algorithms when number of machines is low.

When the number of jobs is low or the system activity is low (Fig. 4a), SJLOF algorithm is still the best, while the other algorithms perform worse (from 1 to over 20 in some cases). However, the medians of all algorithms are similar to the performance of SJLOF. When the number of jobs increases (Fig. 4b), the effect is increased as well – the other algorithms now performing even 50 times worse at times. However, the median of other algorithms is now from around 2 to 8 times higher than that of SJLOF algorithm. These results further confirm that SJLOF is the best algorithm in almost all situations. Its closest competitors – MIN and MAX algorithms – are less reliable.

6 Conclusions

In this paper we presented a model for the problem of scheduling software tests on Testing-as-a-Service cloud system. The problem was modeled as a variant of the online scheduling problem with the goal of minimizing flowtime of test suites. Uni-shaped bimodal Weibull distribution was used to model test suites arrival time in the system and other empirical distributions were used to model other parameters of the system (like test processing time). Instance generator that can be used to easily generate instances of the problem with desired size and suites per day density was also described.

A computer experiment for 800 instances with various job and machine size was used to test effectiveness of a number of constructive algorithms. Presented results indicate that SJLOF algorithm is the most effective in all situations except when the number of machine is high. However, for high number of machines all algorithms perform similarly and SJLOF algorithm still remains competitive. Thus, SJLOF algorithm could be used to reduce the cloud resources required for software testing (thus saving money or allowing the resources to be used for other tasks), while minimizing the increase in flowtime of test suites.