Abstract
Hyperheuristic is a new methodology for the adaptive hybridization of metaheuristic algorithms to derive a general algorithm for solving optimization problems. This work focuses on the selection type of hyperheuristic, called the exponential Monte Carlo with counter (EMCQ). Current implementations rely on the memoryless selection that can be counterproductive as the selected search operator may not (historically) be the best performing operator for the current search instance. Addressing this issue, we propose to integrate the memory into EMCQ for combinatorial twise test suite generation using reinforcement learning based on the Qlearning mechanism, called QEMCQ. The limited application of combinatorial test generation on industrial programs can impact the use of such techniques as QEMCQ. Thus, there is a need to evaluate this kind of approach against relevant industrial software, with a purpose to show the degree of interaction required to cover the code as well as finding faults. We applied QEMCQ on 37 realworld industrial programs written in Function Block Diagram (FBD) language, which is used for developing a train control management system at Bombardier Transportation Sweden AB. The results show that QEMCQ is an efficient technique for test case generation. Addition ally, unlike the twise test suite generation, which deals with the minimization problem, we have also subjected QEMCQ to a maximization problem involving the general module clustering to demonstrate the effectiveness of our approach. The results show the QEMCQ is also capable of outperforming the original EMCQ as well as several recent meta/hyperheuristic including modified choice function, Tabu highlevel hyperheuristic, teaching learningbased optimization, sine cosine algorithm, and symbiotic optimization search in clustering quality within comparable execution time.
Introduction
Despite their considerable success, metaheuristic algorithms have been adapted to solve specific problems based on some domain knowledge. Some examples of recent metaheuristic algorithms include Sooty Tern optimization algorithm (STOA) (Dhiman and Kaur (2019)), farmland fertility algorithm (FF) (Shayanfar and Gharehchopogh (2018)), owl search algorithm (OSA) (Jain et al. (2018)), human mental search (HMS) (Mousavirad and EbrahimpourKomleh (2017)), and findfixfinishexploitanalyze (F3EA) (Kashan et al. (2019)). Often, these algorithms require significant expertise to implement and tune; hence, their standard versions are not sufficiently generic to adapt to changing search spaces, even for the different instances of the same problem. Apart from this need to adapt, the existing research on metaheuristic algorithms has also not sufficiently explored the adoption of more than one metaheuristic to perform the search (termed hybridization). Specifically, the exploration and exploitation of the existing algorithms are limited to use the (local and global) search operators derived from a single metaheuristic algorithm as a basis. In this case, choosing a proper combination of search operators can be the key to achieve good performance as hybridization can capitalize on the strengths and address the deficiencies of each algorithm collectively and synergistically.
Hyperheuristics have recently received considerable attention for addressing some of the above issues (Tsai et al. 2014; Sabar and Kendall 2015). Specifically, hyperheuristic represents an approach of using (meta)heuristics to choose (meta)heuristics to solve the optimization problem at hand (Burke et al. 2003). Unlike traditional metaheuristics, which directly operate on the solution space, hyperheuristics offer flexible integration and adaptive manipulation of complete (lowlevel) metaheuristics or merely the partial adoption of a particular metaheuristic search operator through nondomain feedback. In this manner, hyperheuristic can evolve its heuristic selection and acceptance mechanism in searching for a goodquality solution.
This work is focusing on a specific type of hyperheuristic algorithm, called the exponential Monte Carlo with counter (EMCQ) Sabar and Kendall (2015); Kendall et al. (2014). EMCQ adopts a simulated annealing like Kirkpatrick et al. (1983) reward and punishment mechanism to adaptively choose the search operator dynamically during runtime from a set of available operators. To be specific, EMCQ rewards a good performing search operator by allowing its reselection in the next iteration. Based on decreasing probability, EMCQ also rewards (and penalizes) a poor performing search operator to escape from local optima. In the current implementation, when a poor search operator is penalized, it is put in the Tabu list, and EMCQ will choose a new search operator from the available search operators randomly. Such memoryless selection can be counterproductive as the selected search operator may not (historically) be the best performing operator for the current search instance. For this reason, we propose to integrate the memory into EMCQ using reinforcement learning based on the Qlearning mechanism, called QEMCQ.
We have adopted QEMCQ for combinatorial interaction twise test generation (where t indicates the interaction strength). While there is already significant work on adopting hyperheuristic as a suitable method for twise test suite generation [see, e.g., Zamli et al. (2016, 2017)], the main focus has been on the generation of minimal test suites. It is worthy of mentioning here that in this work, our main focus is not to introduce new bounds for the twise generated test suites. Rather we dedicate our efforts on assessing the effectiveness and efficiency of the generated twise test suites against realworld programs being used in industrial practice. Our goal is to push toward the industrial adoption of twise testing, which is lacking in numerous studies on the subject. We, nevertheless, do compare the performance of QEMCQ against the wellknown benchmarks using several strategies, to establish the viability of QEMCQ for further empirical evaluation using industrial programs. In the empirical evaluation part of this paper, we rigorously evaluate the effectiveness and efficiency of QEMCQ for different degrees of interaction strength using realworld industrial control software used for developing the train control management system at Bombardier Transportation Sweden AB. To demonstrate the generality of QEMCQ, we have also subjected QEMCQ a maximization problem involving the general module clustering. QEMCQ gives the best overall performance on the clustering quality within comparable execution time as compared to competing hyperheuristics (MCF and Tabu HHH) and metaheuristics (EMCQ, TLBO, SCA, and SOS). Summing up, this paper makes the following contributions:
This paper makes the following contributions:

1.
A novel QEMCQ hyperheuristic technique that embeds the Qlearning mechanism into EMCQ, providing a memory of the performance of each search operator for selection. The implementation of QEMCQ establishes a unified strategy for the integration and hybridization of Monte Carlobased exponential Metropolis probability function for metaheuristic selection and acceptance mechanism with four lowlevel search operators consisting of cuckoo’s Levy flight perturbation operator (Yang and Deb 2009), flower algorithm’s local pollination, and global pollination operator (Yang 2012) as well as Jaya’s search operator (Rao 2016).

2.
An industrial case study, evaluating twise test suite generation in terms of cost (i.e., using a comparison of the number of test cases) and effectiveness (i.e., using mutation analysis).

3.
Performance assessment of QEMCQ with contemporary meta/hyperheuristics for maximization problem involving general module clustering problem.
Theoretical Background and an Illustrative Example
Covering array (CA) is a mathematical object to represent the actual set of test cases based on twise coverage criteria (where t represents the desired interaction strength). CA (N; t, k, v), also expressed as CA \((N; t, v^k)\), is a combinatorial structure constructed as an array of N rows and k columns on v values such that every \(N \times t\) subarray contains all ordered subsets from the v values of size t at least once. Mixed covering array (MCA) \((N; t, k, (v_1, v_2,\ldots \ v_k))\) or MCA \((N; t, k, v^k)\) may be adopted when the number of component values varies.
To illustrate the use of CA for twise testing, consider a hypothetical example of an integrated manufacturing system in Fig. 1. There are four basic elements/parameters of the system, i.e., Camera, Robotic Interface, Sensor, and Network Cables. The camera parameter takes three possible values (i.e., Camera = \(\lbrace \)High Resolution, Web Cam, and CCTV\(\rbrace \)), whereas the rest of the parameters take two possible values (i.e., Robotic Interface = \(\lbrace \)USB, HDMI\(\rbrace \), Sensor = \(\lbrace \)Thermometer, Heat Sensor\(\rbrace \), and Network Cables = \(\lbrace \)UTP, Fiber Optics\(\rbrace \)).
As an example, the mixed CA representation for MCA \((N; 3, 3^1 2^3)\) is shown in Fig. 2 with twelve test cases. In this case, there is a reduction of 50% test cases from the 24 exhaustive possibilities.
Related Work
In this section, we present the previous work performed on the combinatorial twise test generation and the evaluation of such techniques in terms of efficiency and effectiveness.
Combinatorial twise test suite generators
CA construction is an NPcomplete problem (Lei and Tai 1998). CA construction is directly applied for twise test case reduction; thus, considerable research has been carried out to develop effective strategies for obtaining (near) optimal solutions. The existing works for CA generation can be classified into two main approaches: mathematical and greedy computational approaches. The mathematical approach often exploits the mathematical properties of orthogonal arrays to construct efficient CA (Mandl 1985). An example of strategies that originate from the extension of mathematical concepts called orthogonal array is recursive CA (Colbourn et al. 2006). The main limitation of the OA solutions is that these techniques restrict the selection of values, which are confined to low interaction (i.e., \(t < 3\)), thus limiting its applicability for only smallscale systems configurations. Greedy computational approaches exploit computing power to generate the required CA, such that each solution results from the greedy selection of the required interaction. The greedy computational approaches can be categorized further into oneparameteratatime (OPAT) and onetestatatime (OTAT) methods (Nie and Leung 2011). Inparameterorder (IPO) strategy (Lei and Tai 1998) is perhaps the pioneer strategy that adopts the OPAT approach (hence termed IPOlike). IPO strategy is later generalized into a number of variants IPOG (Lei et al. 2007), IPOGD (Lei et al. 2008), IPOF (Forbes et al. 2008), and IPOs (Calvagna and Gargantini 2009), whereas AETG (Cohen et al. 1997) is the first CA construction strategy that adopts the OTAT method (hence, termed as AETGlike (Williams and Probert 1996)). Many variants of AETG emerged later, including mAETG (Cohen 2004) and \(mAETG_SAT\) (Cohen et al. 2007).
One can find two recent trends in research for combinatorial interaction testing: handling of constraints (Ahmed et al. 2017) and the application of metaheuristic algorithms. Many current studies focus on the use of metaheuristic algorithms as part of the greedy computational approach for CA construction (Mahmoud and Ahmed 2015; Wu et al. 2015; Ahmed et al. 2012). Metaheuristicbased strategies, which complement both the OPAT and OTAT methods, are often superior in terms of obtaining optimal CA size, but tradeoffs regarding computational costs may exist. Metaheuristicbased strategies often start with a population of random solutions. One or more search operators are iteratively applied to the population to improve the overall fitness (i.e., regarding greedily covering the interaction combinations). Although variations are numerous, the main difference between metaheuristic strategies is on the defined search operators. Metaheuristics such as genetic algorithm (e.g., GA) Shiba et al. (2004), ant colony optimization (e.g., ACO) Chen et al. (2009), simulated annealing (e.g., SA) Cohen et al. (2007), particle swarm optimization (e.g., PSTG Ahmed et al. (2012), DPSO) Wu et al. (2015), and cuckoo search algorithm (e.g., CS) Ahmed et al. (2015) are effectively used for CA construction.
In line with the development of metaheuristic algorithms, the room for improvement is substantial to advance the field of searchbased software engineering (SBSE) by the provision of hybridizing two or more algorithms. Each algorithm usually has its advantages and disadvantages. With hybridization, each algorithm can exploit the strengths and cover the weaknesses of the collaborating algorithms (i.e., either partly or in full). Many recent scientific results indicate that hybridization improves the performance of metaheuristic algorithms (Sabar and Kendall 2015).
Owing to its ability to accommodate two or more search operators from different metaheuristics (partly or in full) through one defined parent heuristic (Burke et al. 2013), hyperheuristics can be seen as an elegant way to support hybridization. To be specific, the selection of a particular search operator at any particular instance can be adaptively decided (by the parent metaheuristic) based on the feedback from its previous performance (i.e., learning).
In general, hyperheuristic can be categorized as either selective or generative ones (Burke et al. 2010). Ideally, a selective hyperheuristic can select the appropriate heuristics from a pool of possible heuristics. On the other hand, a generative hyperheuristic can generate new heuristics from the existing ones. Typically, selective and generative hyperheuristics can be further categorized as either constructive or perturbative ones. A constructive gradually builds a particular solution from scratch. On the other hand, a perturbative hyperheuristic iteratively improves an existing solution by relying on its perturbative mechanisms.
In hyperheuristic, there is a need to maintain a “domain barrier” that controls and filters out domainspecific information from the hyperheuristic itself (Burke et al. 2013). In other words, hyperheuristic ensures generality to its approach.
Concerning related work for CA construction, Zamli et al. (2016) implemented Tabu search hyperheuristic (Tabu HHH) utilizing a selection hyperheuristic based on Tabu search and three measures (quality, diversity, and intensity) to assist the heuristic selection process. Although showing promising results, Tabu HHH adopted full metaheuristic algorithms (i.e., comprising of teaching learningbased optimization (TLBO) Rao et al. (2011), particle swarm optimization (PSO) Kennedy and Eberhart (1995), and cuckoo search algorithm (CS) Yang and Deb (2009)) as its search operators. Using the three measures in HHH, Zamli et al. (2017) later introduced the new Mamdani fuzzybased hyperheuristic that can accommodate partial truth, hence allowing a smoother transition between the search operators. In other work, Jia et al. (2015) implemented a simulated annealingbased hyperheuristic called HHSA to select from variants of six operators (i.e., single/multiple/smart mutation, simple/smart add and delete row). HHSA demonstrates good performance regarding test suite size and exhibits elements of learning in the selection of the search operator.
Complementing HHSA, we propose QEMCQ as another alternative SA variant. Unlike HHSA, we integrate the Qlearning mechanism to provide a memory of the performance of each search operator for selection. The Qlearning mechanism complements the Monte Carlobased exponential Metropolis probability function by keeping track of the best performing operators for selection when the current fitness function is poor. Also, unlike HHSA, which deals only with CA (with constraints) construction, our work also focuses on MCA.
Case studies on combinatorial twise interaction test generation
The number of successful applications of combinatorial interaction testing in the literature is expanding. Few studies (Kuhn and Okum 2006; Richard Kuhn et al. 2004; Bell and Vouk 2005; Wallace and Richard Kuhn 2001; Charbachi et al. 2017; Bergström and Enoiu 2017; Sampath and Bryce 2012; Charbachi et al. 2017) are focusing on fault and failure detection capabilities of these techniques for different industrial systems. However, still, there is a lack of industrial applicability of combinatorial interaction testing strategies.
Some case studies concerning combinatorial testing have focused on comparing between different strengths of combinatorial criteria (Grindal et al. 2006) with random tests (Ghandehari et al. 2014; Schroeder et al. 2004) and the coverage achieved by such test cases. For example, Cohen et al. (1996) found that pairwise generated tests can achieve 90% code coverage by using the AETG tool. Other studies (Cohen et al. 1994; Dalal et al. 1998; Sampath and Bryce 2012) have reported the use of combinatorial testing on realworld systems and how it can help in the detection of faults when compared to other test design techniques.
Few papers examine the effectiveness (i.e., the ability of test cases to detect faults) of combinatorial tests of different twise strengths and how these strategies compare with each other. There is some empirical evidence suggesting that across a variety of domains, all failures could be triggered by a maximum of fourway interactions (Kuhn and Okum 2006; Richard Kuhn et al. 2004; Bell and Vouk 2005; Wallace and Richard Kuhn 2001). In one such case, 67% of failures are caused by oneparameter, twoway combinations cause 93% of failures, and 98% by threeway combinations. The detection rate for other studies is similar, reaching 100% fault detection by the use of fourway interactions. These results encouraged our interest in investigating a larger case study on how QEMCQ and different interaction strengths perform in terms of test efficiency and effectiveness for industrial software systems and study the degree of interaction involved in detecting faults for such programs.
Overview of the proposed strategy
The highlevel view of QEMCQ strategy is illustrated in Fig. 3. The main components of QEMCQ consist of the algorithm (along with its selection and acceptance mechanism) and the defined search operators. Referring to Fig. 3, QEMCQ chooses the search operator much like a multiplexer via a search operator connector based on the memory on its previous performances (i.e., penalize and reward). However, it should be noted that the Qlearning mechanism is only summoned when there are no improvements in the prior iteration. The complete detailed working of QEMCQ is highlighted in the next subsections.
Qlearning Monte Carlo hyperheuristic strategy
The exponential Monte Carlo with counter (EMCQ) algorithm from Ayob and Kendall (2003); Kendall et al. (2014) has been adopted in this work as the basis of QEMCQ selection and acceptance mechanism. EMCQ algorithm accepts poor solution (similar to simulated annealing (Kirkpatrick et al. 1983); the probability density is defined as:
where \(\delta \) is the difference in fitness value between the current solution (\(S_{i}\)) and the previous solution (\(S_{0}\)) (i.e., \(\delta =f(S_{i})f(S_{0})\)), T is the iteration counter, and q is a control parameter for consecutive nonimproving iterations.
Similar to simulated annealing, probability density \(\varPsi \) decreases toward zero as T increases. However, unlike simulated annealing, EMCQ does not use any specific cooling schedule; hence, specific parameters do not need to be tuned. Another notable feature is that EMCQ allows dynamic manipulation on its q parameter to increase or decrease the probability of accepting poor moves. q is always incremented upon a poor move and reset to 1 upon a good move to enhance the diversification of the solution.
Although adopting the same cooling schedule as EMCQ, QEMCQ has a different reward and punishment mechanism. For EMCQ, the reward is based solely on the previous performance (although sometimes the poor performing operator may also be rewarded based on some probability). Unlike EMCQ, when a poor search operator is penalized, QEMCQ chooses the historically best performing operator for the next search instance instead of the available search operators randomly.
Qlearning is a Markov decision process that relies on the current and forwardlooking Qvalues. It provides the reward and punishment mechanism (Christopher 1992) that dynamically keeps track of the best performing operator via online reinforcement learning. To be specific, Qlearning learns the optimal selection policy by its interaction with the environment. Qlearning works by estimating the best state–action pair through the manipulation of memory based on Q(s, a) table. A Q(s, a) table uses a state–action pair to index a Qvalue (i.e., as cumulative reward). The Q(s, a) table is updated dynamically based on the reward and punishment (r) from a particular state–action pair.
Let \(S=[s_{1},s_{2},\ldots ,s_{n}]\) be a set of states, \(A=[a_{1},a_{2},\ldots ,a_{n}]\) be a set of actions, \(\alpha _{t}\) be the learning rate within [0, 1], \(\gamma \) be the discount factor within [0, 1], and \(r_{t}\) be the immediate reward/punishment acquired from executing action a, the Q(st, at) as the cumulative reward at time (t) can be computed as follows:
The optimal setting for t, \(\gamma \), and \(r_{t}\) needs further clarification. When \(\alpha _{t}\) is close to 1, a higher priority is given to the newly gained information for the Qtable updates. On the contrary, a small value of \(\alpha _{t}\) gives higher priority to the existing information. To facilitate exploration of the search space (to maximize learning from the environment), the value of \(\alpha _{t}\) during early iteration can be set a high value, but adaptively reduce toward the end of the iteration (to exploit the existing best known Qvalue) as follows:
The parameter \(\gamma \) works as the scaling factor for rewarding or punishing the Qvalue based on the current action. When \(\gamma \) is close to 0, the Qvalue is based on the current reward/punishment only. When \(\gamma \) is close to 1, the Qvalue will be based on the current and the previous reward/punishment. It is suggested to set \(\gamma = 0.8\) Samma et al. (2016).
The parameter \(r_t\) serves as the actual reward or punishment value. In our current work, the value of \(r_t\) is set based on:
Based on the discussion above, Algorithm 1 highlights the pseudocode for QEMCQ.
QEMCQ involves three main steps, denoted as Steps A, B, and C. Step A deals with the initialization of variables. Line 1 initializes the populations of the required twise interactions, \(I={I_{1},I_{2},\ldots ,I_{M}}\). The value of M depends on the given inputs interaction strength (t), parameter (k), and its corresponding value (v). M captures the number of required interactions that need to be captured in the constructed CA. M can be mathematically obtained as the sum of products of each individual’s twise interaction. For example, for \(CA(9;2,3^{4})\), M takes the value of \(3\times 3+3\times 3+3\times 3+3\times 3+3\times 3+3\times 3=54\). If \(MCA(9;2,3^{2}2^{2})\) is considered, then M takes the value of \(3\times 3+3\times 2+3\times 2+3\times 2+3\times 2+2\times 2=37\). Line 2 defines the maximum iteration \(\varTheta _{max}\) and population size, N. Line 3 randomly initializes the initial population of solution \(X={X_{1},X_{2},\ldots ,X_{M}}\). Line 4 defines the pool of search operators. Lines 6–14 explore the search space for 1 complete episode cycle to initialize the Qtable.
Step B deals with the QEMCQ selection and acceptance mechanism. The main loop starts in line 15 with \(\varTheta _{\hbox {max}}\) as the maximum number of iteration. The selected search operator will be executed in line 17. The Qtable will be updated accordingly based on the quality/performance of the current state–action pairs (lines 18–24). Like EMCQ, the Monte Carlo Metropolis probability controls the selection of search operators when the quality of the solution improves (lines 25–30). This probability decreases with iteration (T). However, it may also increase as the Qvalue can be reset to 1 (in the case of reselection of any particular search operator (lines 29 and 34)). When the quality does not improve, the Qlearning gets a chance to explore the search space in one complete episode cycle (as line 33) to complete the Qtable entries. As an illustration, Fig. 4 depicts the snapshot of one entire Qtable cycle for QEMCQ along with a numerical example.
Referring to episode 1 in Fig. 4, assume that the initial settings are as follows: the current state \(s_{t}\) = Lévy flight perturbation operator, the next action \(a_{t}\) = local pollination operator, the current value stored in the Qtable for the current state \(Q_{(t+1)}(s_{t},a_{t})=1.25\) (i.e., grayed cell); the punishment \(r_{t}=1.00\); the discount factor \(\gamma =0.10\); and the current learning factor \(\alpha _{t}=0.70\). Then, the new value for \(Q_{(t+1)}(s_{t},a_{t})\) in the Qtable is updated based on Eq. 2 as:
Concerning episode 2 in Fig. 4, the current settings are as follows: the current state \(s_{t}\)= Local Pollination Operator, the next action \(a_{t}\)= Global Pollination Operator, the current value stored in the Qtable for the current state \(Q_{(t+1)}(s_{t},a_{t})=1.00\) (i.e., grayed cell ); the punishment \(r_{t}=1.00\); the discount factor \(\gamma =0.10\); and the current learning factor \(\alpha _{t}=0.70\). Then, the new value for \(Q_{(t+1)}(s_{t},a_{t})\) in the Qtable is updated based on Eq. 2 as:
Considering episode 3 in Fig. 4, the current settings are as follows: the current state \(s_{t}=\)Global Pollination Operator, the next action \(a_{t}=\)Jaya Operator, the current value stored in the Qtable for the current state \(Q_{(t+1)}(s_{t},a_{t})=1.00\) (i.e., grayed cell ); the reward \(r_{t}=1.00\); the discount factor \(\gamma =0.10\); and the current learning factor \(\alpha t=0.70\). Then, the new value for \(Q_{(t+1)}(s_{t},a_{t})\) in the Qtable is updated based on Eq. 2 as:
The complete exploration cycle for updating Qvalues ends in episode 4 as the next action \(a_{t}=s_{(t+1)}=\)Lévy flight perturbation operator. It must be noted that throughout the Qtable updates, the QEMCQ search process is also working in the background (i.e., for each update, \(X_{best}\) is also kept and the population X is also updated accordingly).
A complete cycle update is not always necessary, especially during convergence. Lines 38–39 depict the search operator selection process as the next action (\(a_{t}\)) (i.e., between Lévy flight perturbation operator, local pollination operator, global pollination operator, and Jaya operator) based on the maximum reward defined in the state–action pair memory within the Qtable (unlike EMCQ where the selection process is random).
Complementing earlier steps, Step C deals with termination and closure. In line 39, upon the completion of the main \(\varTheta _{\hbox {max}}\) loop, the best solution \(S_{best}\) is added to the final CA. If uncovered twise interaction exists, Step B is repeated until termination (line 41).
Cuckoo’s Levy Flight Perturbation Operator
Cuckoo’s Levy flight perturbation operator is derived from the cuckoo search algorithm (CS) Yang and Deb (2009). The complete description of the perturbation operator is summarized in Algorithm 2.
Cuckoo’s Levy flight perturbation operator acts as the local search algorithm that manipulates the Lévy flight motion. For our Lévy flight implementation, we adopt the wellknown Mantegna’s algorithm Yang and Deb (2009). Within this algorithm, a Lévy flight step length can be defined as:
where u and v are approximated from the normal Gaussian distribution in which
For v value estimation, we use \(\sigma _{v}=1\). For u value estimation, we evaluate the gamma function (\(\varGamma \)) with the value of \(\beta =1.5\) Yang (2008) and obtain \(\sigma _{u}\) using
In our case, the gamma function (\(\varGamma \)) implementation is adopted from Press et al. (1992). The Lévy flight motion is essentially a random walk that takes a sequence of jumps, which are selected from a heavytailed probability function (Yang and Deb 2009). As a result, the motion will produce a series of “aggressive” small and large jumps (either positive or negative), thus ensuring largely diverse values. In our implementation, the Lévy flight motion performs a single value perturbation of the current population of solutions, thus rendering it as a local search operator.
As for the working of the operator, the initial \(X_{best}\) is set to \(X_{0}\) in line 1. The loop starts in line 2. One value from a particular individual \(X_{i}\) is selected randomly (columnwise) and perturbed using \(\alpha \) with entrywise multiplication (\(\oplus \)) and levy flight motion (L), as indicated in line 4. If the newly perturbed \(X_{i}\) has a better fitness value, then the incumbent is replaced and the value of \(X_{best}\) is also updated accordingly (in lines 5–11). Otherwise, \(X_{i}\) is not updated, but \(X_{best}\) will be updated based on its fitness against \(X_{i}\).
Flower’s Local Pollination Operator
As the name suggests, the flower’s local pollination operator is derived from the flower algorithm Yang (2012). The complete description of the operator is summarized in Algorithm 3.
In line 1, \(XS_{best}\) is initially set to \(X_{0}\). In line 2, two distinct peer candidates \(X_{p}\) and \(X_{q}\) are randomly selected from the current population X. The loop starts in line 2. Each \(X_{i}\) will be iteratively updated based on the transformation equation defined in lines 4–5. If the newly updated \(X_{i}\) has better fitness value, then the current \(X_{i}\) is replaced accordingly (in lines 6–7). The value of \(X_{best}\) is also updated if it has a better fitness value than that of \(X_{i}\) (in lines 8–10). When the newly updated \(X_{i}\) has poorer fitness value, no update is made to \(X_{i}\), but \(X_{best}\) will be updated if it has better fitness than \(X_{i}\) (in lines 11–12).
Flower’s global pollination operator
Flower’s global pollination operator (Yang 2012) is summarized in Algorithm 4 and complements the local pollination operator described earlier.
Similar to cuckoo’s Levy flight perturbation operator described earlier, the global pollination operator also exploits Levy flight motion to generate a new solution. Unlike the former operator, the transformation equation for flower’s global pollination operator uses the Levy flight to update all the (columnwise) values for \(Z_{i}\) of interest instead of only perturbing one value, thereby making it a global search operator.
Considering the flow of the global pollination operator, \(X_{best}\) is initially set to \(X_{0}\) in line 1. The loop starts in line 2. The value of \(X_{i}\) will be iteratively updated by using the transformation equation that exploits exploiting Levy flight motion (in lines 4–5). If the newly updated \(X_i\) has better fitness value, then the current \(X_i\) is replaced accordingly (in lines 6–7). The value of \(X_{best}\) is also updated if it has a better fitness value than that of \(X_{i}\) (in lines 8–10). If the newly updated \(X_{i}\) has poorer fitness value, no update is made to \(X_{i}\). \(X_{best}\) will be updated if it has better fitness than \(X_{i}\) (in lines 8–10 and lines 11–12).
Jaya search operator
The Jaya search operator is derived from the Jaya algorithm Rao (2016). The complete description of the Jaya operator is summarized in Algorithm 5.
Unlike the search operators described earlier (i.e., keeping track of only \(X_{best}\)), the Jaya search operator keeps track of both \(X_{best}\) and \(X_{poor}\). As seen in line 6, the Jaya search operator exploits both \(X_{best}\) and \(X_{poor}\) as part of its transformation equation. Although biased toward the global search for QEMCQ in our application, the transformation equation can also address local search. In the case when \(\varDelta X=X_{best}X_{poor}\) is sufficiently small, the transformation equation offset (in line with the term \(\mho (X_{best}X_{i}) \zeta (X_{poor}X)\)) will be insignificant relative to the current location of \(X_{i}\) allowing steady intensification.
As far as the flow of the Jaya operator is concerned, lines 1–2 set up the initial values for \(X_{best}=X_{0}\) and \(X_{poor}=X_{best}\). The loop starts from line 3. Two random values \(\mho \) and \(\zeta \) are generated to compensate and scale down the delta differences between \(X_{i}\) with \(X_{best}\) and \(X_{poor}\) in the transformation equation (in lines 4–5). If the newly updated \(X_{i}\) has a better fitness value, then the current \(X_{i}\) is replaced accordingly (in lines 7–8). Similarly, the value of \(X_{best}\) is also updated if it has a better fitness value than that of \(X_{i}\) (in lines 9–11). In the case in which the newly updated \(X_{i}\) has poorer fitness value, no update is made to \(X_{i}\). If the fitness of the current \(X_{i}\) is better than that of \(X_{best}\), \(X_{best}\) is assigned to \(X_{i}\) (in lines 12–13). Similarly, if the fitness of the current \(X_{i}\) is poorer than that of \(X_{poor}\), \(X_{poor}\) is assigned to \(X_{i}\) (in lines 14–15).
Empirical study design
We have put our strategy under extensive evaluation. The goals of the evaluation experiments are threefold: (1) to investigate how QEMCQ fares against its own predecessor EMCQ, (2) to benchmark QEMCQ against wellknown strategies for twise test suite generation, (3) to undertake the effectiveness assessment of QEMCQ using twise criteria in terms of achieving branch coverage as well as revealing mutation injected faults based on realworld industrial applications, (4) to undertake the efficiency assessment of QEMCQ by comparing the test generation cost with manual testing, and (5) to compare the performance of QEMCQ with contemporary metaheuristics and hyperheuristics.
In line with the goals above, we focus on answering the following research questions:

RQ1: In what ways does the use of QEMCQ improve upon EMCQ?

RQ2: How good is the efficiency of QEMCQ in terms of test suite minimization when compared to the existing strategies?

RQ3: How good are combinatorial tests created using QEMCQ and 2wise, 3wise, and 4wise at covering the code?

RQ4: How effective are the combinatorial tests created using QEMCQ for 2wise, 3wise, and 4wise at detecting injected faults?

RQ5: How does QEMCQ with 2wise, 3wise, and 4wise compare with manual testing in terms of cost?

RQ6: Apart from minimization problem (i.e., twise test generation), is QEMCQ sufficiently general to solve (maximization) optimization problem (i.e., module clustering)?
Experimental Benchmark setup
We adopt an environment consisting of a machine running Windows 10, with a 2.9 GHz Intel Core i5 CPU, 16 GB 1867 MHz DDR3 RAM, and 512 GB flash storage. We set the population size of \(N = 20\) with a maximum iteration value \(\theta _{\max } = 2500\). While such a choice of population size and maximum iterations could result in more than 50,000 fitness function evaluations, we limit our maximum fitness function evaluation to 1500 only (i.e., the QEMCQ stops when the fitness function evaluation reaches 1500). This is to ensure that we can have a consistent value of fitness function evaluation throughout the experiments (as each iteration can potentially trigger more than one fitness function evaluation). For statistical significance, we have executed QEMCQ for 20 times for each configuration and reported the best results during these runs.
Experimental Benchmark Procedures
For RQ1, we arbitrarily select 6 combinations of covering arrays \(\hbox {CA}\,(N;2,4^{2}2^{3})\), \(\hbox {CA}\,(N;3,5^{2}4^{2}3^{2})\), \(\hbox {CA}\,(N;4,5^{1}3^{2}2^{3})\), \(\hbox {MCA}\,(N;2,5^{1}3^{3}2^{2})\), \(\hbox {MCA}\,(N,3,6^{1}5^{1}4^{3}3^{3}2^{3})\) and \(\hbox {MCA}\,(N,4,7^{1}6^{1}5^{1}4^{3}3^{3}2^{3})\). Here, the selected covering arrays span both uniform and nonuniform number of parameters. To ensure a fair comparison, we reimplement EMCQ using the same data structure and programming language (in Java) as QEMCQ before adopting it for covering array generation. Our EMCQ reimplementation also rides on the same lowlevel operators (i.e., cuckoo’s Levy flight perturbation operator, flower algorithm’s local pollination, and global pollination operator as well as Jaya’s search operator). For this reason, we can fairly compare both test sizes and execution times.
For RQ2, we adopted the benchmark experiments mainly from Wu et al. (2015). In particular, we adopt two main experiments involving \(\hbox {CA}\,(N;t,v^{7})\) with variable values \(2\le v\le 5\), t varied up to 4 as well as \(\hbox {CA}\,(N;t,3^{k})\) with variable number of parameters \(3\le k\le 12\), t varied up to 4. We have also compared our strategy with those published results for those strategies that are not freely available to download. Parts of those strategies depend mainly on metaheuristic algorithms, specifically HSS, PSTG, DPSO, ACO, and SA. The other part of those strategies is dependent on exact computational algorithms, specifically PICT, TVG, IPOG, and ITCH. We represent all our results in the tables where each cell represents the smallest size (marked as bold) generated by its corresponding strategy. In the case of QEMCQ, we also reported the average sizes to give a better indication of its efficiency. We opt for generated size comparison and not time because all of the strategies of interest are not available to us. Even if these strategies are available, their programming languages and data structure implementations are not the same renderings as an unfair execution time comparison. Often, the size comparison is absolute and is independent of the implementation language and data structure implementation.
For answering RQ3–RQ5, we have selected a train control management system that has been in development for a couple of years. The system is a distributed control software with multiple types of software and hardware components for operationcritical and safetyrelated supervisory behavior of the train. The program runs on programmable logic controllers (PLCs), which are commonly used as realtime controllers used in industrial domains (e.g., manufacturing and avionics); 37 industrial programs have been provided for which we applied the QEMCQ approach for minimizing the twise test suite.
Concerning RQ6, we have selected three public domain class diagrams available freely in the public domains involving Credit Card Payment System (CCPS) Cheong et al. (2012), Unified Inventory University (UIU) Sobh et al. (2010), and Food Book (FB)^{Footnote 1} as our module case studies. Here, we have adopted the QEMCQ approach for maximizing the number of clusters so that we can have the best modularization quality (i.e., best clusters) for all given three systems’ class diagrams.
For comparison purposes, we have adopted two groups of comparison. In the first group, we adopt EMCQ as well as modified choice function (Pour Shahrzad et al. 2018) and Tabu search HHH Zamli et al. (2016) implementations. It should be noted that all the hyperheuristic rides on the same operators (i.e., Lévy flight, local pollination, global pollination, and Jaya). In the second group, we have decided to adopt the TLBO Praditwong et al. (2011), SCA Mirjalili (2016) and SOS Cheng and Prayogo (2014) implementations. Here, we are able to fairly compare the modularization quality as well as execution time as the data structure, language implementation and the running system environment are the same (apart from the same number of maximum fitness function evaluation). It should be noted that these algorithms (i.e., TLBO, SCA, SOS) do not have any parameter controls apart from population size and maximum iteration. Hence, their adoption does not require any parameter calibrations.
Case study object
As highlighted earlier, we adopt two case study objects involving the train control management system as well as the module clustering of class diagrams.
Train control management system
We have conducted our experiment on programs from a train control management system running on PLCs that have been developed for a couple of years. A program running on a PLC executes in a loop in which every cycle contains the reading of input values, the execution of the program without interruptions, and the update of the output variables. As shown in Fig. 5, predefined logical and/or stateful blocks (e.g., bistable latch SR, OR, XOR, AND, greaterthan GT, and timer TON) and connections between blocks represent the behavior of a PLC program written in the Function Block Diagram (FBD) programming language (John and Tiegelkamp 2010). A hardware manufacturer supplies these blocks or is developed using custom functions. PLCs contain particular types of blocks, such as timers (e.g., TON) that provide the same functions as timing relays and are used to activate or deactivate a device after a preset interval of time. There are two different timer blocks: (1) ondelay timer (TON) and (2) offdelay timer (TOF). A timer block keeps track of the number of times its input is either true or false and outputs different signals. In practice, many other timing configurations can be derived from these basic timers. An FBD program is translated to a compliant executable PLC code. For more details on the FBD programming language and PLCs, we refer the reader to the work of John and Tiegelkamp (2010).
We experimented with 37 industrial FBD programs for which we applied the QEMCQ approach. These programs contain ten input parameters and 1209 lines of code on average per program.
To answer our research questions, we generated test cases using QEMCQ for 2wise, 3wise, and 4wise and executed each program on these test cases to collect branch coverage and fault detection scores for each test suite as well as the number of test cases created. A test suite created for a PLC program contains a set of test cases containing inputs, expected and actual outputs together with timing constraints.
Test Case Generation and Manual Testing We used test suites automatically generated using QEMCQ. To do this, we asked an engineer from Bombardier Transportation Sweden AB, responsible for developing and testing the PLC programs used in this study, to identify the range parameter values for each input variable and constraints. We used the collected input parameter ranges for each input variable for generating combinatorial test cases using QEMCQ. These ranges and constraints were also used for creating manual test suites. We collected the number of test cases for each manual test suite created by engineers for each of the programs used in this case study. In testing these PLC programs, the testing processes are performed according to safety standards and certifications, including rigorous specificationbased testing based on functional requirements expressed in natural language. As the programs considered in this study are manually tested and are part of a delivered project, we expect that the number of test cases created manually by experienced industrial engineers to be a realistic proxy measure of the level of efficiency needed to test these PLC programs thoroughly.
Measuring Branch Coverage Code coverage criteria are used in practice to assess the extent to which the PLC program has been covered by test cases (Ammann and Offutt 2008). Many criteria have been proposed in the literature, but in this study, we only focus on branch coverage criteria. For the PLC programs used in this study, the engineers developing software indicated that their certification process involves achieving high branch coverage. A branch coverage score was obtained for each test suite. A test suite satisfies decision coverage if running the test cases causes each branch in the program to have the value true at least once and the value false at least once.
Measuring Fault Detection Fault detection was measured using mutation analysis by generating faulty versions of the PLC programs. Mutation analysis is used in our case study by creating faulty implementations of a program in an automated manner to examine the fault detection ability of a test case (DeMillo et al. 1978). A mutated program is a new version of the original PLC program created by making a small change to this original program. For example, in a PLC program, a mutated program is created by replacing an operator with another, negating an input variable, or changing the value of a constant to another interesting value. If the execution of a test suite on the mutated program gives a different observable behavior as the original PLC program, the test case kills that mutant. We calculated the mutation score using an outputonly oracle against all the created mutated programs. For all programs, we assessed the mutation detection capability of each test case by calculating the ratio of mutated programs killed to the total number of mutated programs. Researchers (Just et al. (2014); Andrews et al. (2005)) investigated the relation between real fault detection and mutant detection, and there is some strong empirical evidence suggesting that if a test case can detect or kill most mutants, it can also be good at detecting naturally occurring faults, thus providing evidence that the mutation score is a fairly good proxy measure for fault detection.
In the creation of mutants, we rely on previous studies that looked at using mutation analysis for PLC software (Shin et al. 2012; Enoiu et al. 2017). We used the mutation operators proposed in Enoiu et al. (2017) for this study. The following mutation operators were used:

Logic Block Replacement Operator (LRO) Replacing a logical block with another block from the same category (e.g., replacing an AND block with an XOR block in Fig. 5).

Comparison Block Replacement Operator (CRO) Replacing a comparison block with another block from the same category (e.g., replacing a greaterthan (GT) block with a greaterorequal (GE) block in Fig. 5).

Arithmetic Block Replacement Operator (ARO) Replacing an arithmetic block with another block from the same functional category (e.g., replacing a maximum (MAX) block with an addition (ADD) block).

Negation Insertion Operator (NIO) Negating an input or output connection between blocks (e.g., a variable var becomes NOT(var)).

Value Replacement Operator (VRO) Replacing a value of a constant variable connected to a block (e.g., replacing a constant value (\(\hbox {var}=5\)) with its boundary values (e.g., \(\hbox {var}=6\), \(\hbox {var}=4\))).

Timer Block Replacement Operator (TRO). Replacing a timer block with another block from the same timer category (e.g., replacing a timeroff (TOF) block with a timerOn (TON) block in Fig. 5).
To generate mutants, each of the mutation operators was systematically applied to each program wherever possible. In total, for all of the selected programs, 1368 mutants (faulty programs based on ARO, LRO, CRO, NIO, VRO, and TRO operators) were generated by automatically introducing a single fault into the program.
Measuring Cost Leung and White (1991) proposed the use of a cost model for comparing testing techniques by using direct and indirect testing costs. A direct cost includes the engineer’s time for performing all activities related to testing, but also the machine resources such as the test environment and testing tools. On the other hand, indirect cost includes test process management and tool development. To accurately measure the cost effort, one would need to measure the direct and indirect costs for performing all testing activities. However, since the case study is performed a postmortem on a system that is already in use and for which the development is finished, this type of cost measurement was not feasible. Instead, we collected the number of test cases generated by QEMCQ as a proxy measure for the cost of testing. We are interested in investigating the cost of using the QEMCQ approach in the same context as manual testing. In this case study, we consider that costs are related to the number of test cases. The higher the number of test cases, the higher is the respective test suite cost. We assume this relationship to be linear. For example, a complex program will require more effort for understanding, and also more tests than a simple program. Thus, the cost measure is related to the same factor—the complexity of the software which will influence the number of test cases. Analyzing the cost measurement results is directly related to the number of test cases giving a picture of the same effort per created test case. In addition to the number of test cases measure, other testing costs are not considered, such as setting up the testing environment and tools, management overhead, and the cost of developing new tests. In this work, we restrict our analysis to the number of test cases created in the context of our industrial case study.
Module clustering of class diagrams
The details of the three class diagrams involved are:

Credit Card Payment System (CCPS) Cheong et al. (2012) consists of 14 classes interlink with 20 twoway associations and 1 aggregation relationship (refer to Fig. 9a ).

Unified Inventory University (UIU) Sobh et al. (2010) consists of 19 classes interlink with 28 aggregations, 1 2wise associations and 1 dependency relationship (refer to Fig. 10a).

Food Book (FB)^{Footnote 2} consists of 31 interlinked classes with 25 2wise associations, 7 generalizations, and 6 aggregations clustered into 3 packages (refer to Fig. 11a).
Module clustering problem involves partitioning a set of modules into clusters based on the concept of coupling (i.e., measuring the dependency between modules) and cohesion (i.e., measuring the internal strength of a module cluster). The higher the coupling, the less readable the piece of code will be, whereas the higher the cohesion, the better to code organization will be. To allow its quantification, Praditwong et al. (2011) define modularization quality(MQ) as the sum of the ratio of intraedges and interedges in each cluster, called modularization factor (\(\hbox {MF}_{k}\)) for cluster k based on the use of module dependency graph such as the class diagram. Mathematically, \(\hbox {MF}_{k}\) can be formally expressed as in Eq. 11:
where i is the weight of intraedges and j is that of interedges. The term \(\frac{1}{2}j\) is to split the penalty of interedges across the two clusters that are connected by that edge. The MQ can then be calculated as the sum of \(\hbox {MF}_k\) as follows:
where n is the number of clusters, and it should be noted that maximizing MQ does not necessarily mean maximizing the clusters.
Case study results
The case study results can be divided into two parts: for answering RQ1–RQ5 and for answering RQ6.
Answering RQ1–RQ5
This section provides an analysis of the data collected in this case study, including the efficiency of QEMCQ and the effectiveness of using combinatorial interaction testing of different strengths for industrial control software. For each program and each generation technique considered in this study, we collected the produced test suites (i.e., 2wise stands for QEMCQ generated test suites using pairwise combinations, 3wise is short for test suites generated using QEMCQ and 3wise interactions and 4wise stands for generated test suites using QEMCQ and 4wise interactions). The overall results of this study are summarized in the form of boxplots in Fig. 7. Statistical analysis was performed using the R software (RProject 2005).
As our observations are drawn from an unknown distribution, we evaluate if there is any statistical difference between 2wise, 3wise, and 4wise without making any assumptions on the distribution of the collected data. We use a Wilcoxon–Mann–Whitney Utest (Howell 2012), a nonparametric hypothesis test for determining if two populations of data samples are drawn at random from identical populations. This statistical test was used in this case study for checking if there is any statistical difference among each measurement metric. Besides, the Vargha–Delaney test (Vargha and Delaney 2000) was used to calculate the standardized effect size, which is a nonparametric magnitude test that shows significance by comparing two populations of data samples and returning the probability that a random sample from one population will be larger than a randomly selected sample from the other. According to Vargha and Delaney (2000), statistical significance is determined when the obtained effect size is above 0, 71 or below 0, 29.
For each measure, we calculated the effect size of 2wise, 3wise, and 4wise and we report in Table 5 the p values of these Wilcoxon–Mann–Whitney Utests with statistically significant effect sizes shown in bold.
RQ1: In what ways does the use of QEMCQ improve upon EMCQ?
Table 1 highlights the results for both QEMCQ and EMCQ results involving the 3 combinations of mixed covering arrays MCA \((N; 2, 5^1 3^3 2^2)\), MCA \((N; 3, 5^2 4^2 3^2)\), and MCA \((N; 4, 5^1 3^2 2^3)\).
Referring to Table 1, we observe that QEMCQ has outperformed EMCQ as far as the average test suite size is concerned in all three MCAs. As for the time performances, EMCQ is better than QEMCQ, notably because there is no overhead as far as maintaining the Qlearning table.
To investigate the performance of QEMCQ and EMCQ further, we plot the convergence profiles for the 20 runs for the three covering arrays, as depicted in Fig. 6a to Fig. 6c. At a glance, visual inspection indicates no difference as far as average convergence is concerned. Nonetheless, when we zoom in all the figures (on the right of Fig. 6a to Fig. 6c), we notice that QEMCQ has better average convergence than EMCQ.
RQ2: How good is the efficiency of QEMCQ in terms of test suite minimization when compared to the existing strategies?
Tables 2 and 3 highlight the results of two main experiments involving CA \((N; t, v^7)\) with variable values \(2 \le v \le 5\), t varied up to 4 as well as CA \((N; t, 3^k)\) with variable number of parameters \(3 \le k \le 12\), t varied up to 4. In general, the authors of the strategies used in our experimental comparisons only provide the best solution quality, in terms of the size N, achieved by them. Thus, these strategies cannot be statistically compared with QEMCQ.
As seen in Tables 2 and 3, the solution quality attained by QEMCQ is very competitive with respect to that produced by the stateoftheart strategies. In fact, QEMCQ is able to match or improve on 7 out of 16 entries in Table 2 (i.e., 43.75%) and 20 out of 27 entries in Table 3 (i.e., 74.07%), respectively. The closest competitor is that of DPSO which scores 6 out of 16 entries in Table 2 (i.e., 37.50%) and 19 out of 27 entries in Table 3 (i.e., 70.37%). Regarding the computational effort, as the strategies used in our comparisons adopt different running environments, data structures, and implementation languages, these algorithms cannot be directly compared with ours.
RQ3: How good are combinatorial tests created using QEMCQ for 2wise, 3wise and 4wise at covering the code?
In Table 4, we present the mutation scores, code coverage results, and the number of test cases in each collected test suite (i.e., 2wise, 3wise, and 4wise generated tests). This table lists the minimum, maximum, median, mean, and standard deviation values. To give an example, 2wise created test suites found an average mutation score of 52%, while 4wise tests achieved an average mutation score of 60%. This shows a considerable improvement in the faultfinding capability obtained by 4wise test suites over their 2wise counterparts. For branch coverage, combinatorial test suites are not able to reach or come close to achieving 100% code coverage on most of the programs considered in this case study.
As seen in Fig. 7b, for the majority of programs considered, combinatorial test suites achieve at least 50% branch coverage. 2wise test suites achieve lower branch coverage scores (on average 84%) than 3wise test suites (on average 86%). The coverage achieved by combinatorial test suites using 4wise is ranging between 50% and 100% with a median branch coverage value of 90%.
As seen in Fig. 7b, the use of combinatorial testing achieves between 84% and 88% branch coverage on average. Results for all programs (in Table 5) show that differences in code coverage achieved by 2wise versus 3wise and 4wise test suites are not strong in terms of any significant statistical difference (with an effect size of 0.4). Even if automatically generated test suites are created by having the purpose of covering up to 4wise input combinations, these test suites are not missing some of the branches in the code. The results are matching our expectations: combinatorial test suites achieve high code coverage to automatically generated test suites using combinatorial goals up to 4wise achieve high branch coverage. Nevertheless, we confirm that there is a need to consider other test design aspects and higher twise strengths to achieve over 90% branch coverage. This underscores the need to study further how combinatorial testing can be improved in practice and what aspects can be taken into account to achieve better code coverage. The programs considered in this study are used in realtime systems to provide operational control in trains. The runtime behavior of such systems depends not only on the choice of parameters but also on providing the right choice of values at the right time points. By considering such information, combinatorial tests might be more effective at covering the code. This needs to be further studied by considering the extent to which twise can be used in combination with other types of information.
RQ4: How effective are tests generated using QEMCQ for 2wise, 3wise, and 4wise at detecting injected faults?
To answer RQ4 regarding the effectiveness in terms of fault detection, we focused on analyzing the test suite quality of combinatorial testing. For all programs, as shown in Fig. 7a, the fault detection scores of pairwise generated test suites show an average mutation score of 52%, but they are not significantly worse than 3wise (57% on average) and 4wise (60%) test suites with no statistically significant differences (effect size of 0, 4 in Table 5). Hence, a test that is generated automatically using combinatorial techniques up to 4wise is not a good indicator of test effectiveness in terms of mutation score. However, one hypothesis is emerging from this result: if 4wise test suites are not achieving a high mutation score, there is a need to generate higherstrength test suites as well as find ways to improve the fault detection scores by using other test design techniques.
This is, to some extent, an entirely surprising result. Our expectation was that combinatorial testing of higher strength than 2wise would yield high scores (over 90%) in terms of fault detection. Tests for 4wise in testing FBD programs would intuitively be quite good test cases at detecting faults. However, the results of our study are not consistent with the results of other studies (Kuhn et al. 2010; Richard Kuhn et al. 2004; Kuhn and Reilly 2002) reporting the degree of interaction occurring in naturally occurring faults. Surprisingly, this expectation does not clearly hold for the results of this study. Our results indicate that combinatorial test cases with interactions up to 4wise are not good indicators of test effectiveness in terms of fault detection. In addition, our results are not showing any statistically significant difference in mutation score between any twise strength considered in this study.
RQ5: How does QEMCQ for 2wise, 3wise, and 4wise compare with manual testing in terms of cost?
As a baseline for comparing the cost of testing, we used test cases created by industrial engineers in Bombardier Transportation for all 37 programs included in this case study. These programs are part of a project delivered already to customers and thoroughly tested. Each test suite contains a set of test cases containing inputs, expected and actual outputs, and time information expressing timing constraints. As in this case study, we consider the number of test cases related to the cost of creating, executing, and checking the result of each test case; we use the number of test cases in a test suite manually created as a realistic measure of cost encountered in the industrial practice for the programs considered. We assume that the higher the number of test cases, the higher are the respective cost associated with each test suite. This section aims to answer RQ5 regarding the relative cost of performing testing concerning the number of test cases generated using QEMCQ in comparison with manually handcrafted tests. As seen in Table 4, the number of test cases for 2wise and 3wise is consistently significantly lower than for 4wise created tests. As seen in Table 5, the cost of performing testing using QEMCQ for 4wise is consistently significantly higher (in terms of the number of test cases) than for manually created test suites; 3wise and 4wise generated test suites are longer (88 and 33 more test cases on average, respectively) over manual testing. There is enough evidence to claim that the results between 4wise and manual test suites are statistically significant, with a pvalue below the traditional statistical significance limit of 0, 05 and a standardized effect size of 0, 157. The effect is weaker for the result between \(3wise\) and manual test suites with a pvalue of 0,05 and an effect size of 0, 376.
As seen in Fig. 8, the use of 2wise consistently results in shorter test suites for all programs than for 3wise and 4wise. It seems like 2wise test suites are comparable with manual test suites in terms of the number of test cases. Examining Table 5, we see the same pattern in the statistical analysis: standardized effect sizes being higher than 0, 1, with pvalue higher than the traditional statistical significance limit of 0, 05. The effect is the strongest for the 2wise and 4wise with a standardized effect size of 0, 08. It seems that 4wise will create much more tests than 2wise, which in practice can affect the cost of performing testing.
Answering RQ6
As highlighted earlier, the experiment for RQ6 investigates the performance of QEMCQ against some selected meta/hyperheuristics.
RQ6: Apart from the minimization problem (i.e., twise test generation), is QEMCQ sufficiently general to solve (maximization) optimization problem (i.e., module clustering)?
As the general observation from the results in Table 6, we note that hyperheuristics generally outperform metaheuristics. This could be due to the fact hyperheuristics can adaptively choose the right operator based on the need for the current search. However, in terms of execution times, general metaheuristics appear to be slightly faster than their hyperheuristic counterparts owing to the direct link from the problem domain to the actual search operators.
Going to specific comparison from hyperheuristic group in Table 6 and Figs. 9b, 10b and 11b, QEMCQ and MCF outperform all other hyperheuristics as far as the best MQ (with 2.226, 2.899, and 4.465) for Credit Card Payment System, Unified University Inventory, and Food Book, respectively. In terms of average, QEMCQ has a better performance than that of MCF. Putting QEMCQ and MCF aside, Tabu HHH outperforms EMCQ in both average and best MQ. On the positive note, EMCQ outperforms all other hyperheuristics as far as execution times are concerned.
Considering the comparison with the metaheuristics, QEMCQ still manages to outperform all algorithms. In the case of the Credit Card Payment System, TLBO manages to match the best of MQ for QEMCQ, although with poorer average MQ. This is expected as the Credit Card Payment System consists of only 14 classes as compared to 19 and 31 classes in the Unified Inventory System, and Food Book, respectively. In terms of execution time, SCA has the best time performance overall for Unified Inventory University (with 37.782 secs) and Food Book (with 56.798 secs), while TLBO gives the best performance for Credit Card Payment System (with 33.531 secs). Here, SOS gives the poorest execution time.
Discussion
Reflecting on the work undertaken, certain observations can be elaborated as lessons learned. In particular, we can group our observations into two parts: The first part relates to the design of QEMCQ and its operators, whereas the second part relates to its performance in the industrial case study.
Concerning the first part, we foresee QEMCQ as a general hybrid metaheuristic. Conventional hybrid metaheuristics are often tightly coupled (whereby two or more operators are interleaved together) and too specific for a particular problem. In addition, the selection of a particular operator during the searching process does not consider the previous performances of that operator. Contrary to conventional hybrid metaheuristic, apart from being adaptive, QEMCQ design is highly flexible. Two aspects of QEMCQ can be treated as “pluggable” components. First, the current Monte Carlo heuristic selection and acceptance mechanism can be replaced with other selection and acceptance mechanisms. Second, the individual search operators can also be replaced with other operators (taking into consideration whether it is for local or global search). For instance, the cuckoo’s perturbation operator can easily be substituted by the simulated annealing’s neighborhood search operator.
Unlike pure metaheuristic approaches, QEMCQ also does not require any specific tuning apart from calibrating maximum iteration and population size. Notably, cuckoo as a standalone algorithm requires the calibration of three control parameters: maximum iteration, population size, and probability (\(p_a\)) for replacing poor eggs. Similarly, flower as a standalone algorithm requires the calibration of three control parameters: maximum iteration, population size, and probability (p) for local or global pollination. Unlike the cuckoo and flower algorithms, the Jaya algorithm does not require additional parameters (other than maximum iteration and population size). Adopted as individual search operators, the cuckoo’s probability (\(p_a\)) and the flower’s probability (p) are completely abandoned within the design of QEMCQ.
Similar to its predecessor EMCQ, the selection of the search operators at any instance of the searching process is adaptively performed based on the Monte Carlo heuristic selection and acceptance mechanism. However, unlike EMCQ, QEMCQ also keeps the memory of the best performing operators via the Qlearning table. The effect of maintaining the memory can be seen as far as average convergence is concerned. In the early iteration stage, QEMCQ behaves like EMCQ as far average convergence is concerned. However, toward the end of the iteration stage, while EMCQ relies solely on the random selection of operators, QEMCQ uses historical performance to perform the selection. For this reason, QEMCQ has better average convergence than EMCQ.
As far as comparative benchmark experiments with other strategies are concerned, we note that QEMCQ and DPSO give the best results overall (see 2 and 3). On the negative note, the approach taken by DPSO is rather problemspecific. On the contrary, our experiments with maximization problems (e.g., module clustering) indicate that the QEMCQ approach is sufficiently general (refer to Table 6) although with smalltime penalty to maintain the Qlearning mechanism. Here, DPSO has introduced two new control parameters as probabilities (pro1 and pro2) in addition to the existing social parameters (\(c_1\) and \(c_2\)) and inertia weight (w) to balance between exploration and exploitation in the context of its application for twise test generation. In this manner, adopting DPSO to other optimization problems can be difficult owing to the need to calibrate and tune all these control parameters accordingly.
On the other side of the spectrum, PICT and IPOG appear to perform the poorest (with no results matching any of the best sizes). A more subtle observation is the fact that metaheuristic and hyperheuristicsbased strategies appear to outperform general computationalbased strategies.
As part of our study, we used the number of test cases to estimate the cost in terms of creation, execution, and result checking. While the cost of creating and executing a test for creating combinatorial tests can be low compared to manual testing, the cost of evaluating the test result is usually humanintensive. Our study suggests that combinatorial test suites for 4wise contain 100 created test steps (number of tests) on average. By considering generating optimized or shorter test suites, one could improve the cost of performing combinatorial testing. We note here that the cost of testing is heavily influenced by the human cost of checking the test result. In this paper, we do not take into account the time of checking the results per test case. In practice, this might not be the real situation. A test strategy, which requires every input parameter in the program to be used in a certain combination, could contain test cases that are not specified in requirements. This might increase the cost of checking the test case result. A more accurate cost model would be needed to obtain more confidence in the results.
The results of this paper show that 2 to 4wise combinations of values are not able to detect more than 60% of injected faults (52% on average for 2wise, 57% on average for 3wise, and 60% on average for 4wise) and are not able to cover more than 88% of the code (84% on average for 2wise, 86% on average for \(32\)wise, and 88% on average for 4wise). Surprisingly, these results are not consistent with the results of other studies (Kuhn et al. 2010; Richard Kuhn et al. 2004; Kuhn and Reilly 2002) reporting the degree of interaction occurring in real faults occurring in industrial systems. While not conclusive, the results of this study are interesting because they suggest that the degree of interaction involved in faults might not be as low as previously thought. As a direct result, testing all 4wise combinations might not provide reasonable assurance in terms of fault detection. There is a need to consider ways of studying the use of higherstrength algorithms and tailoring these to the programs considered in this study, which are used in realtime software systems to provide control capabilities in trains. The behavior of such a program depends not only on the choice of parameters but also on providing the right choice of continuous values. By considering the state of the system of the timing information, combinatorial tests might be more effective at detecting faults. Bergström and Enoiu (2017) indicated that the use of timing information in combinatorial testing for basechoice criterion results in higher code coverage and fault detection. This needs to be further studied by considering the extent to which twise can be used in combination with the realtime behavior of the input parameters.
Limitations
Our results regarding effectiveness are not based on naturally occurring faults. In our study, we automatically seeded mutants to measure the fault detection capability of the written tests. While it is possible that faults are naturally happening in the industry would yield different results, there are some evidence (Just et al. 2014) to support the use of injected faults as substitutes for real faults. Another possible risk of evaluating test suites based on mutation analysis is the equivalent mutant problem in which these faults cannot show any externally visible deviation. The mutation score in this study was calculated based on the ratio of killed mutants to mutants in total (including equivalent mutants, as we do not know which mutants are equivalent). Unfortunately, this fact introduces a threat to the validity of this measurement. In addition, the results are based on a case study in one company using 37 PLC programs. Even if this number can be considered quite small, we argue that having access to real industrial programs created by engineers working in the safetycritical domain can be representative. More studies are needed to generalize these results to other systems and domains.
Finally, our general clustering problem has also dealt with smallscale problems (the largest class diagram is only 31 classes). As the classes get larger, enumeration of the possible solution grows in a factorial manner. With such growth, there could be a potential clustering mismatch. In this case, maximizing MQ can be seen as two conflicting sides of the same coin. On one side of the coin, there is a need to get the largest MQ for better modularization. On the other side of the coin, automatically maximizing MQ for a large set of classes may be counterproductive (in terms of disrupting the overall architectural package structure of the classes). In fact, some individual clusters may not be intuitive to programmers at all. For these reasons, there is a need to balance between getting the good enough MQ (i.e., which may not be the best one) and simultaneously obtaining a meaningful set of clusters.
Conclusions
We present QEMCQ, a Qlearningbased hyperheuristic exponential Monte Carlo with a counter strategy for combinatorial interaction test generation, and show the evaluation results obtained from a case study performed at Bombardier Transportation, a largescale company focusing on developing industrial control software. The 37 programs considered in this study have been in development and are used in different train products all over the world. The evaluation shows that the QEMCQ test generation method is efficient in terms of generation time and test suite size. Our results suggest that combinatorial interaction test generation can achieve high branch coverage. However, these generated test suites do not show high levels of fault detection in terms of mutation score and are more costly (i.e., in terms of the number of created test cases) than manual test suites created by experienced industrial engineers. The obtained results are useful for both practitioners, tool developers, and researchers. Finally, to complement our current work, we have also demonstrated the generality of QEMCQ via addressing the maximization problem (i.e., involving the clustering of class diagrams). For future work, we can focus on exploring the adoption of QEMCQ for large embedded software both for twise test generation as well as its modularization.
Notes
References
Ahmed BS, Zamli KZ, Lim CP (2012) Application of particle swarm optimization to uniform and variable strength covering array construction. Appl Soft Comput 12(4):1330–1347
Ahmed BS, Abdulsamad TS, Potrus MY (2015) Achievement of minimized combinatorial test suite for configurationaware software functional testing using the cuckoo search algorithm. Inf Softw Technol 66(C):13–29
Ahmed BS, Zamli KZ, Afzal W, Bures M (2017) Constrained interaction testing: a systematic literature study. IEEE Access 5:25706–25730
Ammann P, Offutt J (2008) Introduction to software testing. Cambridge University Press, Cambridge
Andrews JH, Briand LC, Labiche Y (2005) Is mutation an appropriate tool for testing experiments? In: Proceedings of the 27th international conference on Software engineering, ACM, pp 402–411
Ayob M, Kendall G (2003) A Monte Carlo hyperheuristic to optimise component placement sequencing for multi head placement machine. In: Placement machine, INTECH’03, Thailand, pp 132–141
Bell KZ, Vouk MA (2005) On effectiveness of pairwise methodology for testing networkcentric software. In: Enabling technologies for the new knowledge society: ITI 3rd international conference on information and communications technology, IEEE, pp 221–235
Bergström H, Enoiu EP (2017) Using timed basechoice coverage criterion for testing industrial control software. In: International conference on software testing, verification and validation workshops (ICSTW), pp 216–219
Burke E, Kendall G, Newall J, Hart E, Ross P, Schulenburg S (2003) Hyperheuristics: an emerging direction in modern search technology. Springer, Boston, pp 457–474
Burke EK, Hyde M, Kendall G, Ochoa G, Özcan E, Woodward JR (2010) A classification of hyperheuristic approaches. Springer, Boston, pp 449–468
Burke EK, Gendreau M, Hyde M, Kendall G, Ochoa G, Özcan E, Rong Q (2013) Hyperheuristics: a survey of the state of the art. J Op Res Soc 64(12):1695–1724
Calvagna A, Gargantini A (2009) Ipos: Incremental generation of combinatorial interaction test data based on symmetries of covering arrays. In: 2009 International conference on software testing, verification, and validation workshops, pp 10–18
Charbachi P, Eklund L, Enoiu E (2017) Can pairwise testing perform comparably to manually handcrafted testing carried out by industrial engineers? In: International conference on software quality, reliability and security companion (QRSC), pp 92–99
Cheng MY, Prayogo D (2014) Symbiotic organisms search: a new metaheuristic optimization algorithm. Comput Struct 139:98–112
Chen X, Gu Q, Li A, Chen D (2009) Variable strength interaction testing with an ant colony system approach. In: Proceedings of the 2009 16th Asiapacific software engineering conference. APSEC ’09, IEEE computer society, Washington, pp 160–167
Cheong CP, Fong S, Lei P, Chatwin C, Young R (2012) Designing an efficient and secure credit cardbased payment system with web services based on ansi x9.59–2006. J Inf Process Syst 8(3):495–520
Christopher JCH (1992) Watkins and Peter Dayan technical note: Qlearning. Mach Learn 8(3):279–292
Cohen DM, Dalal SR, Kajla A, Patton GC (1994) The automatic efficient test generator (AETG) system. In: International symposium on software reliability engineering, IEEE, pp 303–309
Cohen MB (2004) Designing test suites for software interaction testing. Technical report, The University of Auckland, Ph.D. Thesis
Cohen MB, Dwyer MB, Shi J (2007) Interaction testing of highlyconfigurable systems in the presence of constraints. In: Proceedings of the 2007 international symposium on software testing and analysis. ISSTA ’07, ACM, New York, pp 129–139
Cohen DM, Dalal SR, Parelius J, Patton GC (1996) The combinatorial design approach to automatic test generation. IEEE Softw 13(5):83
Cohen DM, Dalal SR, Fredman ML, Patton GC (1997) The AETG system: an approach to testing based on combinatorial design. IEEE Trans Softw Eng 23(7):437–444
Colbourn CJ, Martirosyan SS, Mullen GL, Shasha D, Sherwood GB, Yucas JL (2006) Products of mixed covering arrays of strength two. J Comb Des 14(2):124–138
Dalal SR, Jain A, Karunanithi N, Leaton JM, Lott CM (1998) Modelbased testing of a highly programmable system. In: International symposium on software reliability engineering, IEEE, pp 174–179
DeMillo RA, Lipton RJ, Sayward FG (1978) Hints on test data selection: help for the practicing programmer. In: Computer, vol 11, IEEE
Dhiman G, Kaur A (2019) Stoa: a bioinspired based optimization algorithm for industrial engineering problems. Eng Appl Artif Intell 82:148–174
Enoiu E, Sundmark D, Čaušević A, Pettersson P (2017) A comparative study of manual and automated testing for industrial control software. In: International conference osoftware testing, verification and validation (ICST), IEEE, pp 412–417
Forbes M, Lawrence J, Lei Y, Kacker RN, Kuhn DR (2008) Refining the inparameterorder strategy for constructing covering arrays. J Res Natl Inst Stand Technol 113(5):287–297
Ghandehari LS, Czerwonka J, Lei Y, Shafiee S, Kacker R, Kuhn R (2014) An empirical comparison of combinatorial and random testing. In: International conference on software testing, verification and validation workshops (ICSTW), IEEE, pp 68–77
Grindal M, Lindström B, Offutt J, Andler SF (2006) An evaluation of combination strategies for test case selection. Empir Softw Eng 11(4):583–611
Howell D (2012) Statistical methods for psychology. Cengage Learning, Boston
Jain M, Maurya S, Rani A, Singh V (2018) Owl search algorithm: a novel natureinspired heuristic paradigm for global optimization. J Intell Fuzzy Syst 34:1573–1582
Jia Y, Cohen MB, Harman M, Petke J (2015) Learning combinatorial interaction test generation strategies using hyperheuristic search. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, vol 1, pp 540–550
John KH, Tiegelkamp M (2010) IEC 61131–3: programming industrial automation systems: concepts and programming languages, requirements for programming systems, decisionmaking aids. Springer, Berlin
Just R, Jalali D, Inozemtseva L, Ernst MD, Holmes R, Fraser G (2014) Are mutants a valid substitute for real faults in software testing? In: International symposium on foundations of software engineering, ACM
Kashan AH, TavakkoliMoghaddam R, Gen M (2019) Findfixfinishexploitanalyze (f3ea) metaheuristic algorithm: an effective algorithm with new evolutionary operators for global optimization. Comput Ind Eng 128:192–218
Kendall G, Sabar NR, Ayob M (2014) An exponential Monte Carlo local search algorithm for the berth allocation problem. In: 10th International conference of the practice and theory of automated timetabling, pp 544–548
Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of the IEEE international conference on neural networks, vol 4, pp 1942–1948
Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220(4598):671–680
Kuhn DR, Kacker RN, Lei Y (2010) Sp 800142. Practical combinatorial testing
Kuhn DR, Okum V (2006) Pseudoexhaustive testing for software. In: 30th Annual IEEE/NASA software engineering workshop, SEW’06, IEEE, pp 153–158
Kuhn DR, Reilly MJ (2002) An investigation of the applicability of design of experiments to software testing. In: Proceedings of the 27th annual NASA Goddard/IEEE on software engineering workshop, IEEE, pp 91–95
Lei Y, Raghu Kacker D, Kuhn R, Okun V, Lawrence J (2008) Ipog–ipogd: efficient test generation for multiway combinatorial testing. Softw Test Verif Reliab 18(3):125–148
Lei Y, Kacker R , Kuhn DR, Okun V, Lawrence J (2007) Ipog: a general strategy for tway software testing. In: Proceedings of the 14th annual IEEE international conference and workshops on the engineering of computerbased systems. ECBS ’07, IEEE computer society, Washington, pp 549–556
Lei Y, Tai KC (1998) Inparameterorder: a test generation strategy for pairwise testing. In: Proceedings of third IEEE international highassurance systems engineering symposium (Cat. No. 98EX231), pp 254–261
Leung HKN, White L (1991) A cost model to compare regression test strategies. In: Proceedings of the conference on Software maintenance, IEEE, pp 201–208
Mahmoud T, Ahmed BS (2015) An efficient strategy for covering array construction with fuzzy logicbased adaptive swarm optimization for software testing use. Expert Syst Appl 42(22):8753–8765
Mandl R (1985) Orthogonal latin squares: an application of experiment design to compiler testing. Commun ACM 28(10):1054–1058
Mirjalili S (2016) Sca: a sine cosine algorithm for solving optimization problems. Knowl Based Syst 96:120–133
Mousavirad SJ, EbrahimpourKomleh H (2017) Human mental search: a new populationbased metaheuristic optimization algorithm. Appl Intell 47(3):850–887
Nie C, Leung H (2011) A survey of combinatorial testing. ACM Comput Surv 43(2):11:1–11:29
Pour Shahrzad M, Drake John H, Burke Edmund K (2018) A choice function hyperheuristic framework for the allocation of maintenance tasks in danish railways. Comput Oper Res 93:15–26
Praditwong K, Harman M, Yao X (2011) Software module clustering as a multiobjective search problem. IEEE Trans Softw Eng 37(2):264–282
Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C: the art of scientific computing, 2nd edn. Cambridge University Press, New York
Rao R (2016) Jaya: a simple and new optimization algorithm for solving constrained and unconstrained optimization problems. Int J Ind Eng Comput 7(1):19–34
Rao RV, Savsani VJ, Vakharia DP (2011) Teachinglearningbased optimization: a novel method for constrained mechanical design optimization problems. Comput Aided Des 43(3):303–315
Richard Kuhn D, Wallace DR, Gallo AM (2004) Software fault interactions and implications for software testing. IEEE Trans Softw Eng 30(6):418–421
RProject (2005) R: a language and environment for statistical computing. The R foundation for statistical computing, http://www.Rproject.org
Sabar NR, Kendall G (2015) Population based monte carlo tree search hyperheuristic for combinatorial optimization problems. Inf Sci 314(Supplement C):225–239
Samma H, Lim CP, Saleh JM (2016) A new reinforcement learningbased memetic particle swarm optimizer. Appl Softw Comput 43(C):276–297
Sampath S, Bryce RC (2012) Improving the effectiveness of test suite reduction for usersessionbased testing of web applications. Inf Softw Technol 54(7):724–738
Schroeder PJ, Bolaki P, Gopu V (2004) Comparing the fault detection effectiveness of nway and random test suites. In: International symposium on empirical software engineering, IEEE, pp 49–59
Shayanfar H, Gharehchopogh FS (2018) Farmland fertility: a new metaheuristic algorithm for solving continuous optimization problems. Appl Soft Comput 71:728–746
Shiba T, Tsuchiya T, Kikuno T (2004) Using artificial life techniques to generate test cases for combinatorial testing. In: Proceedings of the 28th annual international computer software and applications conference, COMPSAC ’04, IEEE computer society, Washington, vol 01, pp 72–77
Shin D, Jee E, Bae DH (2012) Empirical evaluation on FBD modelbased test coverage criteria using mutation analysis. In: Model driven engineering languages and systems. Springer
Sobh K, Oliveira D, Liu B, Mayantz M, Zhang YM, Alhazmi A, de Bled R, AlSharawi A (2010) Software design document, testing, deployment and configuration management, and user manual of the UUIS—a team 4 COMP5541W10 project approach. CoRR, abs/1005.0169
Tsai CW, Huang WC, Chiang MH, Chiang MC, Yang CS (2014) A hyperheuristic scheduling algorithm for cloud. IEEE Trans Cloud Comput 2(2):236–250
Vargha A, Delaney HD (2000) A critique and improvement of the CL common language effect size statistics of McGraw and Wong. J Educ Behav Stat 25(2):101–132
Wallace DR, Richard Kuhn D (2001) Failure modes in medical device software: an analysis of 15 years of recall data. Int J Reliab Qual Saf Eng 8(04):351–371
Williams AW, Probert RL (1996) A practical strategy for testing pairwise coverage of network interfaces. In: Proceedings of seventh international symposium on software reliability engineering, pp 246–254
Wu H, Nie C, Kuo FC, Leung H, Colbourn CJ (2015) A discrete particle swarm optimization for covering array generation. IEEE Trans Evolut Comput 19(4):575–591
Yang XS, Deb S (2009) Cuckoo search via levy flights. In: 2009 World congress on nature biologically inspired computing (NaBIC), pp 210–214
Yang XS (2008) Natureinspired metaheuristic algorithms. Luniver Press, Frome
Yang XS (2012) Flower pollination algorithm for global optimization. Springer, Berlin, pp 240–249
Zamli KZ, Alkazemi BY, Kendall G (2016) A tabu search hyperheuristic strategy for tway test suite generation. Appl Soft Comput 44(C):57–74
Zamli KZ, Din F, Kendall G, Ahmed BS (2017) An experimental study of hyperheuristic selection and acceptance mechanism for combinatorial tway test suite generation. Inf Sci 399(C):121–153
Acknowledgements
Open access funding provided by Karlstad University. The work reported in this paper is funded by Fundamental Research Grant from the Ministry of Higher Education Malaysia titled: An Artificial Neural NetworkSine Cosine Algorithmbased Hybrid Prediction Model for the production of Cellulose Nanocrystals from Oil Palm Empty Fruit Bunch (RDU1918014). Wasif Afzal is supported by The Knowledge Foundation through 20160139 (TestMine) & 20130085 (TOCSYC) and the European Union’s Horizon 2020 research and innovation programme, grant agreement No 871319. Eduard Enoiu is funded from the Electronic Component Systems for European Leadership Joint Undertaking under grant agreement No. 737494 and the Swedish Innovation Agency, Vinnova (MegaM@Rt2).
Author information
Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Communicated by V. Loia.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ahmed, B.S., Enoiu, E., Afzal, W. et al. An evaluation of Monte Carlobased hyperheuristic for interaction testing of industrial embedded software applications. Soft Comput 24, 13929–13954 (2020). https://doi.org/10.1007/s0050002004769z
Published:
Issue Date:
Keywords
 Searchbased software engineering (SBSE)
 Fault finding
 System reliability
 Software testing
 Hyperheuristics