Statistical model checking for variability-intensive systems: applications to bug detection and minimization

We propose a new Statistical Model Checking (SMC) method to identify bugs in variability-intensive systems (VIS). The state-space of such systems is exponential in the number of variants, which makes the verification problem harder than for classical systems. To reduce verification time, we propose to combine SMC with featured transition systems (FTS)—a model that represents jointly the state spaces of all variants. Our new methods allow the sampling of executions from one or more (potentially all) variants. We investigate their utility in two complementary use cases. The first case considers the problem of finding all variants that violate a given property expressed in Linear-Time Logic (LTL) within a given simulation budget. To achieve this, we perform random walks in the featured transition system seeking accepting lassos. We show that our method allows us to find bugs much faster (up to 16 times according to our experiments) than exhaustive methods. As any simulation-based approach, however, the risk of Type-1 error exists. We provide a lower bound and an upper bound for the number of simulations to perform to achieve the desired level of confidence. Our empirical study involving 59 properties over three case studies reveals that our method manages to discover all variants violating 41 of the properties. This indicates that SMC can act as a coarse-grained analysis method to quickly identify the set of buggy variants. The second case complements the first one. In case the coarse-grained analysis reveals that no variant can guarantee to satisfy an intended property in all their executions, one should identify the variant that minimizes the probability of violating this property. Thus, we propose a fine-grained SMC method that quickly identifies promising variants and accurately estimates their violation probability. We evaluate different selection strategies and reveal that a genetic algorithm combined with elitist selection yields the best results.


Introduction
We consider the problem of bug detection in Variability Intensive Systems (VIS). This category of systems encompasses any system that can be derived into multiple variants (differing, e.g., in provided functionalities), including software product lines [CN01] and configurable systems [SW98]. Compared to traditional ("single") systems, the complexity of bug detection in VIS is increased: bugs can appear only in some variants, which requires analysing the peculiarities of each variant.
Among the number of techniques developed for bug detection, one finds testing and model checking. Testing [BJK + 05] executes particular test inputs on the system and checks whether it triggers a bug. Albeit testing remains widely used in industry, the rise of concurrency and inherent system complexity has made system-level test case generation a hard problem. Also, testing is often limited to bounded reachability properties and cannot assess liveness properties.
Model checking [BK08] is a formal verification technique which checks that all behaviours of the system satisfy specified requirements. These behaviours are typically modelled as an automaton, where each node represents a state of the system (e.g. a valuation of the variables of a program and a location in this program's execution flow) and where each transition between two states expresses that the program can move from one state to the other by executing a single action (e.g. executing the next program statement). Requirements are often expressed in temporal logics, e.g. the Linear Temporal Logic (LTL) [Pnu77].
Such logics capture both safety and liveness properties of system behaviours. As an example, consider the LTL formula (command sleep ⇒ ♦system sleep). command sleep and system sleep are logic atoms and represent, respectively, a state where the sleep command is input and another state where the system enters sleep mode. The symbols and ♦ means always and eventually, respectively. Thus, the whole formula expresses that "it is always the case that when the sleep command is input, the system eventually enters sleep mode".
Contrary to testing, model checking is exhaustive: if a bug exists, then the checking algorithm outputs a counterexample, i.e. an execution trace of the system that violates the verified property. Exhaustiveness makes model checking an appealing solution to obtain strong guarantees that the system works as intended. It can also nicely complement testing (whose main advantage remains to be applied directly on the running system), e.g. by reasoning over liveness properties or by serving as oracle in test generation processes [ABM98]. Those benefits, however, come at the cost of scalability issues, the most prominent being the state explosion problem. This term refers to the phenomenon where the state space to visit is so huge that an exhaustive search is intractable. As an illustration of this, let us remark that the theoretical complexity of the LTL model checking problem is PSPACE-complete [VW86].
Model checking complexity is further exacerbated when it comes to VIS. Indeed, in this case, the model checking problem requires verifying whether all the variants satisfy the requirements [CCS + 13]. This means that, if the VIS comprises n variation points (n features in a software product line or n Boolean options in a configurable system), the number of different variants to represent and to check can reach 2 n . This exponential factor adds to the inherent complexity of model checking. Thus, checking each variant (or models thereof) separately-an approach known as enumerative or product-based [TAK + 14]-is often intractable. To alleviate this, variability-aware models and companion algorithms were proposed to represent and check efficiently the behaviour of all variants at once. For instance, Featured Transition Systems (FTS) [CCS + 13] are transition systems where transitions are labelled with (a symbolic encoding of) the set of variants able to exercise this transition. The structure of FTS, if well constructed, allows one to capture in a compact manner commonalities between states and transitions of several variants. Exploiting that information, family-based algorithms can check only once the executions that several variants can execute and explore the state space of an individual variant only when it differs from all the others. In spite of positive improvements over the enumerative approach, state-space explosion remains a major challenge.
In this work, we propose an alternative technique for state-space exploration and bug detection in VIS. We use Statistical Model Checking (SMC) [LDB10] as a trade-off between testing and model checking to verify properties (expressed in full LTL) on FTS. The core idea of SMC is to conduct some simulations (i.e. sample executions) of the system (or its model) and verify if these executions satisfy the property to check. The results are then used together with statistical tests to decide whether the system satisfies the property with some degree of confidence. Of course, in contrast with an exhaustive approach, a simulation-based solution does not guarantee a result with 100% confidence. Still, it is possible to bound the probability of making an error. Simulation-based methods are known to be far less memory-and time-consuming than exhaustive ones and are sometimes the only viable option. Over the past years, SMC has been used to, e.g. assess the absence of errors in various areas from aeronautic to systems biology; measure cost average and energy consumption for complex applications such as nanosatellites; or detect rare bugs in concurrent systems [JLS13,CIM + 13,LL18].
In the context of VIS, we propose to use SMC in order to support two types of analyses. The first type is a coarse-grained analysis to identify the set of buggy variants and give some confidence that the other variants do

Background on model checking
In model checking, the behaviour of the system is often represented as a transition system (S , , AP , L) where S is a set of states, ⊆ S × S is the transition relation, AP is a set of atomic propositions 1 and L : S → 2 AP labels any state with the atomic propositions that the system satisfies when in such a state.

Linear temporal logic
LTL is a temporal logic that allows specifying desired properties over all future executions of some given system. Given a set AP of atomic propositions, an LTL formula φ is formed according to the following grammar: φ :: where φ 1 and φ 2 are LTL formulae, a ∈ AP , is the next operator and U is the until operator. We also define ♦φ ("eventually" φ) and φ ("always" φ) as a shortcut for Uφ and ¬ ♦¬ φ, respectively.
Vardi and Wolper have presented an automata-based approach for checking that a system-modelled as a transition system ts-satisfies an LTL formula φ [VW86]. Their approach consists of, first, transforming φ into a Büchi automaton B ¬ φ whose language is exactly the set of executions that violate φ, that is, those that visit infinitely often a so-called accepting state. Such execution σ takes the form of a lasso, i.e. σ q 0 . . . q n with q j q n for some j and where q i is accepting for some i : j ≤ i ≤ n. We name accepting any such lasso whose cycle contains an accepting state.
The second step is to compute the synchronous product of ts and B ¬ φ , which results in another Büchi automaton B ts⊗¬ φ . Any accepting lasso in B ts⊗¬ φ represents an execution of the system that violates φ. Thus, Vardi and Wolper's algorithm comes down to checking the absence of such accepting lasso in the whole state space of B ts⊗¬ φ . The size of this state space is O(| ts | × | 2 |φ| |) and the complexity of this algorithm is PSPACE-complete.

Model checking for variability-intensive systems
Applying classical model checking to VIS requires iterating over all variants, construct their corresponding automata B ts⊗¬ φ and search for accepting lasso in each of these. This enumerative method (also named productbased [TAK + 14]) fails to exploit the fact that variants have behaviour in common.
As an alternative, researchers came up with models able to capture the behaviour of multiple variants and distinguish between the unique and common behaviour of those variants [CDEG03, CCS + 13, tBFGM16]. Among such models, we focus on featured transition systems [CCS + 13] as those can link an execution to the variants able to execute it more directly than the alternative formalisms. In a nutshell, FTS extend the standard transition system by labelling each transition with a symbolic encoding of the set of variants able to exercise this transition. Then, the set of variants that can produce an execution π is the intersection of all sets of variants associated with the transitions in π .
To check which variants violate a given LTL formula φ, one can adapt the procedure of Vardi and Wolper and build the synchronous product of the featured transition system with B ¬ φ [CCS + 13]. This product is similar to the Büchi automaton obtained in the single system case, except that its transitions are also labelled with a set of variants. 2 Then, the buggy variants are those that are able to execute the accepting lassos of this automaton. This generalized automaton is the fundamental formalism we work on in this paper.
Definition 1 Let V be a set of variants. A Featured Büchi Automaton (FBA) over V is a tuple (Q, , Q 0 , A, , γ ) where Q is a set of states, ⊆ Q × Q is the transition relation, Q 0 ⊆ Q is a set of initial states, A ⊆ Q is the set of accepting states, is the whole set of variants, and γ : → 2 associates each transition with the set of variants that can execute it. Figure 1 shows an FBA with two variants and eight states. State 5 is the only accepting state. Both variants can execute the transition from State 3 to State 4, whereas only variant v 2 can move from State 3 to State 6.
The Büchi automaton corresponding to one particular variant v is derived by removing the transitions not executable by v . That is, we remove all transitions (q, q ) ∈ such that v ∈ γ (q, q ). The resulting automaton is named the projection of the FBA onto v . For example, one obtains the projection of the FBA in Fig. 1 onto v 2 by removing the transition from State 3 to State 7 and those between State 7 and State 8.

Statistical model checking
Originally, SMC was used to compute the probability to satisfy a bounded LTL property for stochastic system [YS02]. The idea was to monitor the properties on bounded executions represented by Bernoulli variables and then use Monte Carlo to estimate the resulting property. SMC also applies to non-stochastic systems by assuming an implicit uniform probability distribution on each state successor. Grosu and Smolka [GS05] lean on this and propose an SMC method to address the full LTL model-checking problem. Their sampling algorithm walks randomly through the state space of B ts⊗¬ φ until it finds a lasso. They repeat the process M times and conclude that the system satisfies the property if and only if none of the M lassos is accepting. They also show that, given a confidence ratio δ and assuming that the probability p for an execution of the system exceeds an error margin , setting M δ 1− bounds the probability of a Type-1 error (rejecting the hypothesis that the system violates the property while it actually violates it) by δ. Thus, M can serve as a minimal number of samples to perform. Our work extends theirs in order to support VIS instead of single systems. Other work on applying SMC to the full LTL logic can be found in [DHKP17,YCZ10].

VIS verification
There are numerous models proposed to VIS verification. For instance, Classen et al. proposed Featured Transition System [CCS + 13] (FTS) formalism which is an automata-based model that relies on transitions labelled by a features expression. Consequently, this formalism determine which variants can exercice the transition. Using this information, the fact that variants have behaviour in common could be exploited, leading to significant speedup in terms of verification time. Accordingly, Classen et al. proposed variability-aware exhaustive algorithms to model-check FTS. By contrast, our approach samples execution traces from FTS to reduce the verification effort.
In addition to FTS, other models have been extended to capture the behaviour of multiple variants such as modal transition system [tBFGM16], product-line-CCS [GLS08], featured Petri-nets [MCP10]. Each formalism has a different syntax and semantics. Modal transition systems or modal I/O automata use optional "may" transitions to model variability. Similarly, product-line-CSS is a process algebra with alternative choice operators between two process to model the behavior of a set of variants. All these approaches are valid solutions for VIS verification. However, most of them are isolated efforts that do not come with mature companion tools. Our work is, therefore, based on FTS. Ter Beek et al.'s [tBFGM16] solution based on modal transition systems is another mature approach. However, it requires the use of a separate logic to link variants to their behaviour in the model, which we found to be less practical than the explicit variability information contained in FTS. This information makes it easy and efficient to determine the variants that can execute a given buggy behaviour [CHS + 10].

SMC for VIS
Recent work has applied SMC in the context of VIS. Vandin et al. [VtBLL18,tBLLV20], proposed an algebraic language to formally model behaviour with dynamic variability (i.e. where the system can adapt its configuration during its execution). Vandin et al. also proposed a product-based SMC approach to check properties on the reconfigurable system. Contrary to this work, our approach assumes static variability (the variants are distinct systems) and relies on family-based algorithms to reason more efficiently on the whole set of variants.
Dubslaff et al. [DKB14] and Nunes et al. [RAN + 15] have studied VIS subject to stochastic behaviour and proposed exhaustive model-checking algorithms to check the probabilistic properties of such systems. Because of their exhaustive nature and the inherent computational cost of stochastic model checking approaches, these algorithms suffer from scalability limitations. To overcome this scalability issue, Delahaye et al. applied SMC to stochastic parametric systems [DFL19]. Their approach opens the possibility to verify stochastic VIS, where each variant is a valuation of the parameters. More precisely, Delahaye et al. target the verification of quantitative reachability properties. By contrast, we support non-quantitative but more general properties (expressed in LTL).

Optimizations of VIS
The application of optimization methods to VIS are numerous (see, e.g., [HPHLT15, OSCR12, SRK + 12]). These applications generally rely on constraint satisfaction problem, satisfiability modulo theory and integer linear programming. While these methods were effective in their particular context of use,they generally do not scale to large VIS and require complete knowledge of the function to optimize [ORGC14].
Our second use case is an optimization problem-we aim to find the variants that minimize the likelihood that a bug occurs. In this case, the function to optimize is dynamically valued during the simulation of the variants' behaviour. Therefore, we cannot characterize this function a priori. Approximate optimizations fill these gaps by applying meta-search heuristics such as genetic algorithm (GA) [GWW + 11] or filtered cartesian flattening [HPHLT15]. Such methods return an approximate solution close to the optimum with more scalability and flexibility. Therefore, meta-heuristics like genetic algorithms are appropriate to solve our use case; these algorithms can explore the search space of a VIS even with partial knowledge-which we build through the simulation of variants' behaviour. We opted for genetic algorithms because they offer a balance between convergence and disruption in the search space. Moreover, they can exploit hidden relationships between the features of the variants [GWW + 11] and the probability to minimize. Our experiments reveal that our genetic approach provides good results, though other meta-heuristics could work as well [HJK + 14].

Family-based statistical model checking for bug detection
The problem of bug detection in VIS using model checking can be phrased as follows: given an FBA over a set of variants V , compute the subset V bug ⊆ V such that v ∈ V bug if and only if v can execute an accepting lasso in the FBA (or equivalently, the projection of the FBA onto v has a reachable accepting lasso).
The purpose of SMC is to reduce the verification effort (when visiting the state space of the system model) by sampling a given number of executions (here, lassos). This gain in efficiency, however, comes at the risk of Type-1 errors.
where σ is a lasso of fba and σ is the set of the variants able to execute σ and accept is true iff σ is accepting.
Algorithm 1: Random Lasso Sampling Indeed, while the discovery of a counterexample leads with certainty to the conclusion that the variants able to execute it violate the property φ, the fact that the sampling did not find a counterexample for some variant v does not entail a 100% guarantee that v satisfies φ. The more lassos we sample, the more confident we can get that the variants without counterexamples satisfy φ. Thus, designing a family-based SMC method involves answering three questions: (1) how to sample executions; (2) how to choose a suitable number of executions; (3) what is the associated probability of Type-1 error.

Random sampling in featured Büchi automata
One can sample a lasso in an FBA by randomly walking through its state space, starting from a randomly-chosen initial state and ending as soon as a cycle is found. A particular restriction is that this lasso should be executable by at least one variant; otherwise, we would sample a behaviour that does not actually exist. The set of variants able to execute a given lasso are those that can execute all its transitions, i.e. the intersection of all γ (q, q ) met along the transitions of this lasso. More generally, we define the lasso sample space of an FBA as follows.
Algorithm 1 formalises the sampling of lassos in a deadlock-free FBA. 3 After randomly picking an initial state (Line 1), we walk through the state space by randomly choosing, at each iteration, a successor state among those available (Line 7-18). Throughout the search, we maintain the set of variants σ that can execute σ so far (Line 16). Then, we use this set as a filter when selecting successor states to ensure that σ remains executable by at least one variant. At Line 13, Succ σ is the set of successors q of q (last state of σ ) that can be reached. We stop the search as soon as we reach a previously visited state (Line 7). If this state was visited before the last accepting state, it means that the sampled lasso is accepting (Line 19).
A motivated criticism [ODG + 11] of the use of random walk to sample lasso is that shorter lassos receive a higher probability to be sampled. To counterbalance this, we implemented a heuristic named multi-lasso [GS05]. It consists of ignoring backward transitions that do not lead to an accepting lasso if there are still forward transitions to explore. This is achieved by modifying Line 13 such that backward transitions leading to a non-accepting lasso are not considered in the successor set.
Assuming a uniform selection of outgoing transitions from each state, one can compute the probability that a random walk samples any given lasso from the sample space.

Definition 3
The probability P (σ ) of a lasso σ q 0 . . . q n is inductively defined as follows: In the case of classical Büchi automata, Grosu et al. [GS05] shows that (L, P(L), P ) defines a probability space. The proof derives from the observation that the lasso sample space contains all non-subsuming finite prefixes of all infinite paths of the automaton. Therefore, one can reduce the problem of sampling from an infinite space of infinite executions to sampling from a finite set of finite prefixes. This proof remains valid for our procedure to sample lassos in an FBA. Indeed, consider the Büchi automaton obtained by removing the feature expressions on the FBA transitions and removing the transitions that no variant can execute. The lasso sample space of this automaton is exactly the lasso sample space from which our procedure samples. Indeed, the only necessary condition for our procedure to sample a lasso is that at least one variant should be able to execute this lasso (see Line 9 of Algorithm 1).
Let us consider an example. In the FBA from Fig. 1, there are two non-accepting lassos (l 1 (1, 2, 1) and l 2 (1, 3, 7, 8, 7)) and two accepting lassos (l 3 (1, 3, 4, 5, 3) and l 4 (1, 3, 6, 5, 3)). Both variants can execute lassos l 3 , while only v 1 can execute l 2 and only v 2 can execute l 1 and l 4 . The probability of sampling l 1 is 1 2 , whereas P [l 2 ] P [l 3 ] P [l 4 ] 1 6 . Thus, the probability of sampling a counterexample executable by v 2 is 1 3 , whereas it is only 1 6 for v 1 . Next, we characterise the relationship between this probability space and any individual variant v . Let L v be the set of lassos executable by v . Since L v ⊆ L, the probability p v to sample such a lasso is σ v ∈L v P (σ ). Note that p v can be different from the probabilityp v of sampling an accepting lasso from the automaton modelling the behaviour of v only (i.e. the projection of the FBA onto v ). This is because, in the FBA, the probability of selecting an outgoing transition from a given state is assigned uniformly regardless of the number of variants able to execute that transition. This balance-breaking effect increases more as the variants have different numbers of unique executions.
Let σ q 0 . . . q n be a lasso in L v . Then P v (σ ) is inductively defined as follows: : v ∈ γ (q, q )}. In our example, P v 1 [l 3 ] 1 2 , as opposed to P [l 3 ] 1 6 . This implies that it is more likely to sample an accepting lasso executable by v 1 from its projection in one trial than it is from the whole FBA in two trials. This illustrates the case where merging the state spaces of the variants can have a negative impact on the capability to find bugs specific to one variant.
Thus, sampling lassos from the FBA allows finding one counterexample executable by multiple products but introduces a bias. Overall, it tends to decrease the probability of sampling lassos from variants that have a smaller state space. This can impact SMC's results and parameter choices, like the number of samples required to get confident results and the associated Type-1 error.

Hypothesis testing
Remember that addressing the model checking problem for VIS requires to find a counterexample for every buggy variant v . Thus, one must sample a number M of lassos such that one gets an accepting lasso for each such buggy variant with a confidence ratio δ. Let fba be a featured Büchi automaton, v be a variant and are not independent since one may sample a lasso executable by more than one variant.
We define X max i 1..|V | X v i . We aim to find a number of sample M such that P [X ≤ M ] ≥ 1 − δ for a confidence ratio δ. This is analogous to the coupon collector's problem [BH97], which asks how many boxes are needed to collect one instance of every coupon placed randomly in the boxes. It differs from the standard formulation in that the probability of occurrence of coupons are neither independent nor uniform, and a single box can contain 0 to | V | coupons. Even for simpler instances of the coupon problem, computing P [X ≤ M ] analytically is known to be hard [Shi07]. Thus, existing solutions rather characterise a lower bound and an upper bound. We follow this approach as well.

Lower bound (minimum number of samples)
To compute a lower bound for the number of samples to draw, we transform the family-based SMC problem to a simpler form (in terms of verification effort). We divide our developments into two parts. First, we show that assigning equal probabilities p v i to every variant v i (obtained by averaging the original probability values) reduces the number M of required samples. As a second step, we show that assuming that all variants share all their executions also reduces M . Doing so, we reduce the family-based SMC problem to its single-system counterpart, which allows us to reuse the theoretical results of Grosu and Smolka [GS05], and obtain the desired lower bound.

Averaged probabilities
.|V | p v and X even be the counterpart of X where all probabilities p v i have been replaced by p avg .

Lemma 4 For any number
Intuitively, the value of X depends mainly on the variants whose accepting lassos are rarer. By averaging the probability of sampling accepting lassos, we raise the likelihood of getting those rarer lassos. Thus, the number of samples required to get an accepting lasso for all variants. Shioda [Shi07] proves a similar result for the coupon collector problem. He does so by showing that the vector p even majorizes p {p v 1 . . . p v n } and that the ccdf 4 of X is a Schur-concave function of the sampling probabilities. Even though our case is more general than the non-uniform coupon collector's problem, the result of Lemma 4 still holds. Indeed, we observe that the theoretical proof of [Shi07] (a) does not assume the independence of the random variables Z v i ; (b) still applies to the dependent case; and (c) supports the case where the sum of the probability values p v i is less than one.

Maximized commonalities
Next, let X all be the particular case of X even where all accepting lassos are executable by all variants and are sampled with probability p avg . The expected number of samples required to find a counterexample that any variant can execute is less than the number required to find a counterexample for each variant. Hence the following Lemma.
Thus, the number of samples to find an accepting lasso for every variant is reduced to the number of samples required to find any accepting lasso. Moreover, let us note that X all is a geometric random variable with parameter p avg . This reduces our problem to sampling an accepting lasso in a classical Büchi automaton and allows us to reuse the results of Grosu and Smolka [GS05].
Lemma 6 For a confidence ratio δ and an error margin , it holds that ln(1−p avg ) . This leads us to the central result of this section.
Theorem 7 Assuming that p avg ≥ avg for a given error margin avg , a lower bound for the number of samples required to find an accepting lasso for each buggy variant is M ln(δ) ln(1− avg ) with a Type-1 error bounded by δ.

Upper bound (maximum number of samples)
We follow a similar two-step process to characterise an upper bound for M . In the first step, we replace the probabilities p v i of every variant by their minimum. In the second step, we alter the model so that the variants have no common behaviour. Then we show that, given a desired degree of confidence, the obtained model requires a higher number of samples than the original one.

Minimum probability
Let p min min v 1..|V | p v and X min be the counterpart of X where all probabilities p v i have been replaced by p min . The ccdf of X being a decreasing function of the sampling probabilities, we have the following lemma.

No common counterexamples
Let {(X indep ) v i } be a set of independent geometric random variables with parameters p min and let X indep max(X indep ) v i . X indep actually encodes the number of samples required to get a counterexample for all buggy variants when those have no common counterexamples. Since the number of samples to perform cannot be reduced by sampling a counterexample executable by multiple variants, we have the following result.

Lemma 9 It holds that
Now, let us note that X indep is an instance of the uniform coupon problem with | V | coupons to collect. A lower bound for P [X indep ≤ M ] is known to be 1− | V | ×(1 − p min ) M [Shi07]. Assuming p min greater than some error ln(1− min ) , which we can use as the upper bound for the number of samples to perform.
Theorem 10 Assuming that p min ≥ min for a given error margin min , an upper bound for the number of samples required to find an accepting lasso for each buggy variant is M ln(δ)−ln(|V |) ln(1− min ) with a Type-1 error is bounded by δ.

Product-based statistical model checking for minimizing bug likelihood
Our second use case focuses on the scenario where no variant can completely avoid triggering a bug (i.e. the violation of an LTL formula). The goal is, then, to find the variants minimizing the probability of revealing the bug with a limited amount of sampling execution budget. Formally, let fts be an FTS over a set of variant V , φ be an LTL formula, and fba fts ⊗ B ¬ φ . One should find argmin v ∈V P (fba |v | k φ) where | k means "satisfies after executing k transitions". Assuming a uniform distribution over executable transitions, one can approximate this probability with a maximum likelihood estimator, i.e.
where Paths [k ] (fba |v ) is a sufficiently large sample of path prefixes of length k in fba |v . If Paths [k ] (fba |v ) can be computed, a method that would solve argmin v ∈V P (fba |v | k φ) for each variant v would naturally return the variant that minimizes the probability of revealing the bug. However, computing Paths [k ] (fba |v ) could require an intractable number of samples for an intractable number of variants. Thus, our scenario is concerned with finding an appropriate budget allocation to maximize the likelihood of discovering variants with a low violation probability.

Genetic algorithm
To solve this problem, we propose an elitist Generic Algorithm (GA) that combines variant sampling with statistical model checking. This family of algorithms are convenient to solve our problem as they can identify promising variants through evolution mechanism [GWW + 11, HJK + 14]. This allows to generate more promising variants based on previous ones. Moreover, such algorithms have been employed to solve search problems with inaccurate evaluation methods (i.e., noisy fitness function [RKD17]). The noise of our fitness function originates from the estimation error in violation probabilities (which is reduced as more path prefixes are sampled for a given variant).
Over the last decade, genetic algorithms (GA) have been extensively used as search and optimization tools in various problem domains, including sciences, commerce, and engineering. A GA is a stochastic algorithm that mimics biological evolution. First, the algorithm creates an initial set of solutions called individuals (or chromosomes). This set represents the population. Individuals are encoded as binary vectors where each bit represents the absence or presence of a gene. The evolutionary process begins with this initial population and proceeds over multiple generations. In each generation, the algorithm evaluates the best individuals in the population through a problem-specific fitness function, i.e. an objective function that measures the quality of individuals and enables their comparison. Alterers (i.e. mutation and crossover operators) are used to alter the best individuals (mimicking species' evolution) to produce the individuals of the next generation. This evolution process continues until an individual with a desired level of quality is produced or until a fixed number of generations reaches. In our case, our GA generates variants over multiple iterations (generations) aiming to find the ones with the lowest probability to violate the given LTL formula. At each generation, the algorithm allocates a sampling budget to each individual (i.e., each variant) of the current population, samples that many executions from each of these variants, and updates by averaging the violation probability of the variants accordingly. The next generation is created using two mechanisms. The elitism mechanism consists of keeping some of the most promising variants whereas the evolution mechanism will fill the rest of the next generation using mutation and crossover alterers.
Population and genotype. We adopt a binary encoding of the set of variants, that is, we associate each variant with a binary vector that uniquely identifies the variant. Since the set V of variants is finite, one can always define such an encoding using at least log 2 (| V |) bits. The resulting binary vectors form the genotype of the individuals, i.e., each bit is a gene. The only requirement for an individual to be valid is that its associated vector corresponds to an actual variant. In practice, VIS are often defined as a set of boolean features representing the variation points between variants, together with propositional logic formulae defining dependency constraints over these features. In such a case, the binary encoding uses more than the minimal number of bits, which requires a more careful checking of its validity (e.g. using SAT solvers). [BSRC10] Fitness Function. The fitness function returns, for any variant, its estimated probability of violating the input LTL formula. At each generation, the algorithm samples a predefined number of executions for each variant of the current population. Then, it updates its estimated violation probability according to the number of those executions that violate the property. Thus, the estimated probability values evolve over the generations. Formally, the fitness function of a variant v i is given by: where Paths [k ] (v i ) is the set of k -length paths that have been sampled from fba |v . Therefore, the fitness of a variant is the current estimate of its violation probability. Thus, in the long run, it is equivalent to the function that we seek to minimize. The heuristic nature of our algorithm lies in the fact that the fitness function is only an estimation (computed from simulation runs) which becomes more accurate over the iterations of the algorithm. Selectors and alterers. At each iteration of the genetic algorithm, a subset of the individuals is selected to pursue evolution, and some of them are being altered by mutation to produce new individuals. Generally speaking, an appropriate choice of selection and alteration operators largely depends on the modelled problem. In our context, we focus on finding and keeping a promising variant with the lowest amount of simulation budget. As for alterers, two individuals are combined to form an offspring based on single-point crossover with a high 0.8 recombination probability, while gene mutation in the offspring occurs with a probability of 0.4. For selection, we keep the 1/3 of the elite variants with the best fitness (i.e., variants that minimize the property violation) and fill the rest with altered variants from the elite variant.

Generation process
Algorithm 2 formalizes the generation process of our genetic algorithm. As a first step, we initialize the set of sampled paths for each variant to the empty set (Line 1).We also generate an initial population P including L (6 in our experiments) variants selected randomly (Line 2). Then, we make the population evolve for a predefined number N of generations (Lines 3-24) set to 12. At each iteration (generation), we first update the estimated violation probability of each variant of the current population P (i.e. the fitness function) after sampling the predefined number of executions in their projection (Lines 4-6).
Input: V , a set of variants; fba, an FBA over V ; k ∈ N 0 , the execution length ; M , the number of samples; N , the number of generations; L, the size of the population; Output: v best , the most promising variant What follows is the application of selectors and alterers to form the next generation P . We first use elitist selection that keeps some of the best individuals according to the fitness function (Lines 8-11). As for alterers (Lines 12-21), we randomly apply crossover and mutation operators to fill the new generation. For the crossover, we may (probability of p c 0.8 in our experiments) randomly pick pairs of individuals from the previous generation and use a simulated binary crossover [RKD17] to create two new offsprings. We assign the same probabilistic importance to each parent . Next, we may (p m 0.4) also apply binary mutation to alter randomly the genes of each selected individual (Line 21). Each feature (gene) has a probability p f 0.1 to be altered. At the end of the mutation process, we obtain a new population P to proceed in the next generation.
After the specified number of generations passed, the algorithm returns the variant with the best fitness from the last generation.

Objectives and methodology
Our experiments concern the evaluation of our SMC methods covering the two use cases. They aim to answer six research questions, i.e. the first three related to the first use case (bug detection) and the last three related to the second use case (bug likelihood estimation and minimization).
One can regard SMC as a means of speeding up verification while risking missing counterexamples. Our first question studies this trade-off and analyses the empirical Type-1 error rate. More precisely, we compute the detection rate of our family-based SMC method, expressed as the number of buggy variants that it detects over the total number of buggy variants.

RQ1 What is the empirical buggy variant detection rate of family-based SMC?
We compute the detection rate for different numbers M of samples lying between the lower and upper bounds as characterised in Sect. 3. To get the ground truth (i.e. the true set of all buggy variants), we execute the exhaustive LTL model checking algorithms for FTS designed by Classen et al. [CCS + 13]. For the lower bound, we assume that the average probability to sample an accepting lasso for any variant is higher than avg 0.01. Setting a confidence ratio δ 0.05 yields ln(0.05) ln(0.99) 298. We round up and set M 300 as our lower bound. For the higher bound, we assume that the minimum probability to sample a counterexample in a buggy variant is higher than 18478. For convenience, we round it up to 19, 200 300 · 2 6 . In the end, we successively set M to 300, 600, . . . , 19200 and observe the detection rates.
Next, we investigate a complementary scenario where the engineer has a limited budget of samples to check. We study the smallest budget required by SMC to detect all buggy variants (in the cases where it can indeed detect all of them) and what is the incurred computation resources compared to an exhaustive search of the state space. Thus, our second question is:

RQ2 How much efficient is SMC with a minimal sample budget compared to an exhaustive search?
Finally, we compare family-based SMC with the alternative of sampling in each variant's state space separately. We name this alternative method enumerative SMC. Hence, our third research question is:

RQ3 How does family-based SMC compares with enumerative SMC?
As before, we compare the two techniques w.r.t. detection rate. We set M to the same values as in RQ1. In enumerative SMC, this means that each variant receives a budget of samples of M |V | where M is the number of samples used in family-based SMC and V is the set of variants.
Once we have shown that SMC can accurately identify buggy variants, we focus on the second use case. We evaluate the capability of our method to estimate the probability of variants to violate given LTL formulae and identify the variants that have the lowest violation probability. We ask:

RQ4 Can SMC identify the optimal variants minimizing bug likelihood?
To answer this question, we compare our elitist genetic algorithm with a non-elitist variation and two random baselines (random selection with and without elitism). The main baseline, named random with elitism, randomly picks and evaluates variants while keeping the best variants across iterations. This method is similar to our genetic algorithm and differs only by the selection of new variants to evaluate (e.g. it performs random selection rather than mutations). The other baselines are non-elitist versions of the random and the genetic approaches (i.e. they may not retain the best variants of each generation to the next one).
We apply each strategy and record the probability of the variant returned by each technique. We compare the returned variant with the best one using ground truth (i.e. the real violation probability of each variant). The ground truth is computed using a large enough sampling budget per variant.
The previous question compares the genetic approach with baselines with a determined budget. We also aim to understand the impact of this budget on the effectiveness of our SMC methods. A higher simulation budget should obviously improve the quality of variants (i.e., minimizing bug likelihood) returned by methods. We ask:

RQ5 How does the sample budget impact the effectiveness of our SMC methods?
To answer this question, we repeat our RQ4 experiments with different simulation budgets. Doing so allows us to observe the effect of providing SMC with more budget and the trend for each alternative method.
Finally, the effectiveness of our genetic algorithm may depend on meta-parameters like the elitism ratio and the probabilities of alterers. Our last question investigates the impact of these parameters. We ask:

RQ6 How do genetic parameters impact our SMC method to identify the variants minimizing bug?
To answer this question, we repeat our RQ5 experiments with four configurations of the aforementioned parameters and we observe the trend of each configuration.

Experimental setup
Implementation. We implemented our SMC algorithms (family-based and enumerative-based) in a prototype tool developed in C and Python. More precisely, the SMC algorithms were implemented in C by reusing the Promela parser of ProVeLines [CSHL13b], a state-of-the-art model checker for VIS. The tool takes as input an FTS, an LTL formula and a sample budget. Then it performs SMC until all samples are checked or until all variants are found to violate the formula. The Python component implements our generic algorithm on top of pyeasy-ga 6 . To compare with the exhaustive search, we use ProVeLines [CSHL13b], to ensure a fair comparison by guaranteeing that all approaches explore the same state space. Dataset. We consider three systems that were used in past research to evaluate VIS model checking algorithms [CCS + 13, CSHL13a, CHL + 14]. Table 1 summarizes the characteristics of our case studies and their related properties. The first system is a minepump system [KMSL83, CCS + 13] with 128 variants. The underlying FTS comprises 250,561 states, while the state space of all variants taken individually reaches 889,124 states. The second model is an elevator model inspired by Plath and Ryan [PR01]. It is composed of eight configuration options, which can be combined into 256 different variants, and its FTS has 58,945,690 states to explore. The third and last is a case study inspired by the CCSDS File Delivery Protocol (CFDP) [Con07], a real-world configurable spacecraft communication protocol [BCH + 10]. The FTS modelling the protocol consists of 1,732,536 states to explore and 56 variants (individually totalling 2,890,399 states).
For the RQ related to the first case study (RQ1, RQ2 and RQ3), we discarded the properties that are satisfied by all variants. Those are: Minepump #17, #33, #40; Elevator #13, CFDP #5. Indeed, these properties are not relevant for RQ1 and RQ3 since SMC is trivially correct in such cases. As for RQ2, any small sample budget would return correct results while being more efficient than the exhaustive search. This leaves us with 59 properties.
For the RQ related to the second use case (RQ4, RQ5 and RQ6), we keep only the properties that all variants violate and where the difference between the highest and lowest violation probabilities is at least 0.05. This results in 12 properties in 2 case studies. 5 for the minepump case study and 7 for the elevator. Infrastructure and repetitions. We run our experiments on a MacBook Pro 2018 with a 2.9 GHz Core-i7 processor and macOS 10.14.5. To account for random variations in the sampling, we execute 100 runs of each experiment and compute the average detection rates for each property. Figure 2 shows as boxplots, for each case study and over all checked properties, the percentage of buggy variants for which family-based SMC found a counterexample. We provide those boxplots for different number M of samples.

RQ1: detection rate
Elevator (58,945,690 FTS states, 256 valid variants) CFDP (1,801,581 FTS states, 56 valid variants) In the case of Minepump and Elevator, the median detection percentage is 100% starting from M 1200 and M 600, respectively. Further increasing the number of samples raises the 0.25 percentile. In Minepump and for M 1200, there are 18/41 properties for which SMC could not detect all buggy variants. Increasing M improves significantly the percentage of buggy variants detected by SMC for all these properties, although there remain undetected variants in 15 of them even with M 19, 200. This illustrates that our assumption regarding p min was inappropriate for those properties: counterexamples are rarer than we imagined. The elevator study yields even better results: at M 600, SMC detects all buggy variants for 10/18 properties; this number becomes 14/18 at M 2, 400 and 17/18 at M 9, 600. As for the remaining property, SMC with M 19, 200 detects 50% of the variants on average and we observe that this percentage consistently increases as we increase M .
The results for CFDP are mixed: while the median percentage goes beyond 80% as soon as M 1, 200, it tends to saturate when increasing the number of samples. The 0.25 percentile still increases but also seems to reach an asymptotic behaviour in the trials with the highest M . A detailed look at the results reveals that for M ≥ 1, 200, SMC cannot identify all buggy variants for only two properties: #3 (9 buggy variants) and #4 (4 buggy variants). At M 19, 200, SMC detects 5.43 and 3.14 buggy variants for those two properties, respectively.
Further doubling M raises these numbers to 6.36 and 3.26. This indicates that the non-detected variants have few counterexamples, which are rare due to the tinier state space of those variants. The computation resources required by SMC to find such rare counterexamples with high confidence are higher than model-checking the undetected variants thoroughly. An alternative would be to direct SMC towards rare executions, leaning on techniques such as [JLS13, CIM + 13].
SMC can detect all buggy variants for 41 properties out of 59. However, for the remaining properties, SMC was unable to find the rare counterexamples of some buggy variants. This calls for new dedicated heuristics to sample those rare executions.

RQ2: efficiency
Next, we check how much execution time SMC can spare compared to the exhaustive search. Results are shown in Table 2. Overall, we see that SMC holds the potential to greatly accelerate the discovery of all buggy variants, achieving a total speedup of 526%, 1891% and 356% for Minepump, Elevator and CFDP, respectively. For more than half of the properties, the smallest number of samples we tried (i.e. 300) was sufficient for a thorough detection. Those properties are actually satisfied by all variants. The fact that SMC requires such a small number of samples means that the same bug lies in all the variants (as opposed to each variant violating the property in its own way). On the contrary, Minepump property #31 is also violated by all variants but requires a much higher sample number, which illustrates the presence of variant-specific bugs.
Interestingly, the benefits of SMC are higher in the Elevator case (the largest of the three models), achieving speedups of up to 16,575%. A likely explanation is that the execution paths of the Elevator model share many similarities, which means that a single bug can lead to multiple failed executions. By sampling randomly, SMC avoids exploring thoroughly a part of the state space that contains no bug and, instead, increases the likelihood to move to interesting (likely-buggy) parts. A striking example is property #16 (satisfied by half of the variants), where SMC reduces the verification time from 3 minutes to 4 seconds.
Where SMC can detect all buggy variants, it can do so with more efficiency compared to exhaustive search, for 33/41 properties, achieving speedups of multiple orders of magnitude. Figure 3 shows the detection rate achieved the enumerative SMC for the three case studies and different numbers of samples, while the results of the family-based SMC were shown in Fig. 2. In the Minepump and Elevator cases, enumerative SMC achieves a lower detection rate than family-based SMC. In both cases, a Student t-test with α 0.05 rejects, with statistical significance, the hypothesis that the two SMC methods yield no difference in error rate. One can observe, for instance, that, with 600 samples, enumerative SMC achieves a median detection rate of 31.13%, while family-based SMC achieved 99.86%. This tends to validate our hypothesis that family-based SMC is more effective as the variants share more executions. Indeed, on average, one state of the Minepump is shared by 3.55 variants.

RQ3: family-based SMC versus enumerative SMC
In the case of CFDP, however, enumerative SMC performs systematically better (up to 13.95% more). Still, the difference in median detection rate tends to disappear as more executions are sampled. Nevertheless, CFDP illustrates the main drawback of family-based SMC: it can overlook counterexamples in variants with fewer behaviours. In fact, previous research [CCS + 13] has shown that, compared to enumerative approaches, the benefits of family-based methods diminishes as there are more divergence between the behaviour of the variants and as these divergences occur earlier during verification. These conclusions also apply to our SMC-based method, as it exploits similar heuristics to check common behaviour only once. In other words, the effectiveness of our method diminishes as there is a higher number of behaviours unique to some variants. Still, the non-exhaustive nature of our approach adds another dimension to the problem: the rarer such unique behaviour is, the less likely our SMC method samples this behaviour. In such cases, enumerative SMC might complement family-based SMC by sampling from the state space of specific variants. Family-based SMC can detect significantly more buggy variants than enumerative SMC, especially when few lassos are sampled. Yet, enumerative SMC remains useful for variants that have a tiny state space compared to the others and can, thus, complement the family-based method. Table 3 shows, for each case study and LTL formula, the difference δ between the real violation probability of the best variant v best and the variant v res returned by each method. We estimated those real probabilities by running a large sample of executions for each variant. We experimentally found out that 1,920,000 was sufficiently large to get a precise estimate of the violation probability of each variant, with negligible random differences across multiple runs (approximately 1.e-4). We evaluate each method on a sampling budget M 19, 200. Hence, each method will sample a maximum of 19, 200 lassos (total for all variants). Given the random factors induced in each method, we repeated our experiments 100 times and computed the average values obtained for δ. We also compute the average rank of the variant returned by each method. Table 3. Difference δ between the real violation probability of the best variant v best and the variant v res returned by each method, for each property. Left numbers are the value of δ whereas the right number (between parentheses) are the percentage of error that δ represents. We also show the average ranking of the returned variant (less is better) over 100 runs. Overall, the results show that the genetic algorithm with elitism performs better than the other strategies for both Minepump (128 variants) and Elevator (256 variants). For Minepump properties, it returns on average a product in the top 15%; by comparison, the random method with elitism (the second-best approach) returns a product in the top 18%. In the Elevator case study, the average ranking becomes 11% (our genetic approach with elitism) against 16% (random with elitism).

RQ4: bug likelihood minimization
The absence of elitism substantially degrades the results for both random and our genetic algorithm, with random search performing a bit better in this setting. The genetic selection mechanism seems more efficient than a random one. This could be explained by the fact that exploiting previous variants features to select other variants is more efficient than random selection.
We observe that the efficiency of our product-based methods depends on the studied property. For example, the GA with elitism offers impressive effectiveness for Elevator#1,#8,#9. However, the same techniques has disappointing effectiveness for some cases, e.g. Minepump#4 and Elevator#5.
SMC can effectively identify the variants that minimize bug likelihood using non-exhaustive methods while remaining scalable. Figure 4 shows, for different simulation budgets, the effectiveness of the SMC methods. Unsurprisingly, increasing the simulation budget improves the efficiency of every method. The heuristics with elitism generally outperform the others. This is mainly due to the fact that, without elitism, the algorithm may replace the current best variants of the population with worse ones. The lack of elitism, furthermore, seems to annihilate the benefits of mutation and cross-over operators.

RQ5: sample budget impact on bug likelihood minimization
When elitism is added, the generic algorithm performs better than the random heuristics and much better than non-elitist versions. In fact, the results of non-elitist versions are reached by elitist ones with approximately 4 times less simulation budget. When elitism mechanism is added, the mutation-based selection performs better than the random heuristics.  This is because mutating and breeding the best variants of the current population allows the exploitation of the neighbourhood of those current best variants. On the contrary, random favours exploration over exploitation and, therefore, fails to exploit combinations of features that are effective.
The genetic algorithm with elitism outperforms the other methods and the differences increase as the budget increases as well.

RQ6: genetic parameters impact on bug likelihood minimization?
We evaluate different configurations of the genetic algorithm using different elitism ratios and alterer parameters. We consider two elitism ratios: 1/3 (E−) and 2/3 (E+). We also consider two sets of alterer parameters: high alteration (M+) and low alteration (M−). In M+, crossover probability was set to 0.8 and mutation to 0.4/0.1meaning that 40% of the genes have 10% probability to be mutated. In M-, are set to 0.2 for crossover and 0.1/0.01 for mutation. These sets of elitism ratios and alterer parameters yield four configurations in total.
We measure again the effectiveness of our method-for each configuration-as the average ranking of the selected variant, for different simulation budget. Figure 5 shows the results. We can observe that the E−/M− (low elitism and low mutation) version of our algorithm performs much worse than the other (in both case studies). Furthermore, E+/M+ and E+/M− perform similarly. Finally, E−/m+ seems to perform slightly better than the other configurations. This shows that a good balance between elitism and alteration yield the best results. High mutation will provide more disruption in variants produced from one generation to the next. On the contrary, the role of elitism is to stabilize the set of variants from one generation to the next-by keeping the best variants. Low alteration tends to perform worse as the method may spend simulation budget on uninteresting variants across multiple generations regardless of the elitism ratio.
Our genetic algorithm tends to perform better when it keeps a low ratio of elitism-1/3-and sets higher alteration probabilities.

Conclusion
We proposed a new simulation-based approach for finding bugs in VIS. It applies statistical model checking to FTS, an extension of transition systems designed to model concisely multiple VIS variants.
In the first use case we investigated, our method results in either collecting counterexamples for multiple variants at once or proving the absence of bugs. The algorithm always converges, up to some confidence error which we quantify on the FTS structure by relying on results for the coupon collector problem. After implementing the approach within a state-of-the-art tool, we study its benefits and drawbacks empirically. It turns out that a small number of samples is often sufficient to detect all variants, outperforming an exhaustive search by an order of magnitude. On the downside, we were unable to find counterexamples for some faulty variants and properties. This calls for future research, exploiting techniques to guide the simulation towards rare bugs/events [JLS13, CIM + 13, BDH15] or towards uncovered variants relying, e.g., on distance-based sampling [KGS + 19] or light-weight scheduling sampling [DHS18].
In the second use case, our method combining SMC with genetic algorithms showed the best capability in finding variants with lower violation probabilities. Elitism proved to be an essential constituent to make our approach successful. One particular challenge in allocating the sampling budget resides in that the fitness function is only an approximation of the real violation probabilities (yet increasingly more accurate as more samples are allocated). Interesting directions for the future involve further improvements of our approach with heuristics to deal with such noise in fitness functions.
Nevertheless, the positive outcome of our study is to show that SMC can act as a low-cost-high-reward alternative to exhaustive verification, which can provide thorough results in a majority of cases.