Uniform and scalable sampling of highly configurable systems

Many analyses on configurable software systems are intractable when confronted with colossal and highly-constrained configuration spaces. These analyses could instead use statistical inference, where a tractable sample accurately predicts results for the entire space. To do so, the laws of statistical inference requires each member of the population to be equally likely to be included in the sample, i.e., the sampling process needs to be “uniform”. SAT-samplers have been developed to generate uniform random samples at a reasonable computational cost. However, there is a lack of experimental validation over colossal spaces to show whether the samplers indeed produce uniform samples or not. This paper (i) proposes a new sampler named BDDSampler, (ii) presents a new statistical test to verify sampler uniformity, and (iii) reports the evaluation of BDDSampler and five other state-of-the-art samplers: KUS, QuickSampler, Smarch, Spur, and Unigen2. Our experimental results show only BDDSampler satisfies both scalability and uniformity.

To get a sense of this problem's relevancy and complexity, consider an example taken from the SPL domain. BusyBox 1 is a software tool that replaces many standard GNU/ Linux utilities with a single small executable, thus providing an environment customized for a diversity of embedded systems. To achieve size-optimization, BusyBox is remarkably modular, supporting the inclusion/exclusion of 613 features at compile time. These features and their interrelationships are specified with a configuration language named Kconfig. 2 To guarantee that every valid configuration satisfies all dependencies, the Kconfig model of BusyBox is translated into a Boolean formula that is then processed with a logic engine (Batory 2005;Fernandez-Amoros et al. 2019) (e.g., a SAT solver (Biere et al. 2009)). A valid configuration corresponds to a satisfiable assignment of the formula, also called, a SAT solution (Plazar et al. 2019) or a witness (Chakraborty and Meel 2019).
As a consequence of the inter-feature dependencies, the space of valid configurations ( 7.428 ⋅ 10 146 ) is a tiny portion of the whole configuration space ( 2 613 ): only 2.185 ⋅ 10 −36 % of the possible configurations are valid . Nevertheless, the population of valid configurations is still colossal. Those SPL analyses that examine every valid configuration are unscalable.
For instance, Halin et al. (2019) adopted an exhaustive strategy to test the JHipster 3 system, checking all its valid configurations. JHipster is a code generator for web applications with 45 selectable features that can produce a total of 26,256 valid configurations. Checking this modest configuration space with the INRIA Grid'5000 4 required 4,376 hours of CPU time ( ∼ 182 days), and 5.2 terabytes of disk space.
Others have advocated approaching this and related problems via statistical inference (Alférez et al. 2019;Alves Pereira et al. 2020;Guo et al. 2018;Kaltenecker et al. 2019;Kolesnikov et al. 2019;Nair et al. 2017;Oh et al. 2017;Temple et al. 2016;Weckesser et al. 2018); that is, working with a tractable sample that predicts the results for the entire population. An essential requirement is that all samples be genuinely representative of the population (Kaplan 2012). In other words, each member of the population must be equally likely to be included in a sample. Authors often use the term uniform random sampling (Oh et al. 2017;Plazar et al. 2019;Sharma et al. 2018) for this idea.
A naive approach to get such a sample would (i) generate a random configuration set without considering feature dependencies, and then (ii) check with a logic engine if each configuration conforms to those dependencies. Unfortunately, and as mentioned above, feature dependencies shrink the configuration space extraordinarily, and so getting a single valid configuration randomly is extremely unlikely. As a result, more advanced algorithms generate valid and uniform random samples at a reasonable computational cost.
Verifying that these algorithms and their tools indeed generate genuine uniform samples is a challenge by itself, because it requires examining the consistency between sample statistics and their corresponding population parameters (e.g., how frequently a feature appears in a sample compared to its probability of being included in every valid configuration ). As configuration spaces can be colossal, current procedures that certify a sampler's uniformity has the severe shortcoming of requiring gigantic sample sizes to estimate reliable statistics (Dutra et al. 2018;Achlioptas et al. 2018;Chakraborty and Meel 2019). Consequently, sampler uniformity has been checked only on miniature models so far, which is not convincing. Also, most uniformity procedures compute population parameters in a poorly scalable way (e.g., requiring calling a #SAT solver thousands of times (Plazar et al. 2019)). This paper extends our paper in SPLC'20 , where (i) a statistical test is formulated to reduce the sample size required for assessing a samplers' uniformity, and (ii) population parameters are computed with scalable algorithms we proposed in Heradio et al. (2019). The additional contributions of this present paper are:

A new sampler called BDDSampler, which is built upon a Binary Decision Diagram
(BDD) (Bryant 1986) technology (see Section 3). 2. A new statistical test to validate a samplers' uniformity, reducing the sample size requirements even more than our previous test (see Section 4). BDDSampler is the only sampler that satisfies both uniformity and scalability. Our software artifacts (BDDSampler, and the data and code scripts for replicating the experiments) are freely available at public repositories (see Section 8).

Related work
Before discussing related work, a terminological clarification is needed. In the machine learning, the term sample usually refers to a single data point (Chollet and Allaire 2018). However, in inferential statistics, a sample is typically a collection of cases, where the number of cases in the sample is the sample size (Chihara and Hesterberg 2011;Kaplan 2012). This paper adopts this latter terminology, and consequently, a sample is a set of configurations (i.e., a collection of SAT-solutions), whose cardinal is its sample size.
Here is additional standard statistical terminology that we will use in this paper. Inferential statistics aims to generalize the results obtained from a sample to the entire population. To do so, the most widespread approach, called Null Hypothesis Significance Test (NHST), quantifies the probability of obtaining the sample results conditioned on the assumption that a given null hypothesis ( H 0 ) is true (NHST fundamentals are explained in Chapter 13 of Kaplan (2012) and Chapter 3 of Vasishth and Broe (2011)). If such probability (named p-value) is less or equal than an established threshold (called the significance level ( )) then H 0 is rejected, and thus its alternative hypothesis H a accepted. Otherwise, H 0 is kept.
As Table 1 shows, two mistakes under this framework can be made due to unusual random samples: rejecting a true H 0 (named Type 1 error), and failing to reject a false H 0 (called Type 2 error). The expression 1 − is known as the test's power. The experimenter can adjust the Type 1 and 2 error probabilities through the thresholds and (see Chapter 4 of Vasishth and Broe (2011)).

Uniform random samplers
The following sections summarize some of the most common strategies to generate uniform random samples for a model of a configuration space that is encoded as a Boolean formula .

Atomic mutations (QuickSampler)
QuickSampler 5 (Dutra et al. 2018) uses a heuristic to gain scalability by minimizing the number of calls to a constraint solver. It generates a random configuration without taking into account the formula constraints. This configuration often violates constraints and thus is unsatisfiable. So, QuickSampler calls the Z3 solver (de Moura and Bjørner 2008) to fix the configuration by finding a MAX-SAT-solution. Then, QuickSampler flips the value of each variable and calls again Z3 to get another valid configuration. The differences between the variable values of the original and flipped SAT configurations are called atomic mutations. By combining mutations, QuickSampler quickly generates new configurations without calling the solver as those configurations are usually legal (Dutra et al. 2018).

Hashing-based sampling (Unigen2)
Several techniques divide the space of SAT-solutions into small "cells" of approximately the same size using r independent hash functions. Accordingly, sampling is done by choosing a cell at random, and then getting a satisfying assignment for that cell using a SAT solver. A critical point of these techniques is determining the "right" r value. For instance, Bellare et al. (2000) showed that an r equal to the number of formula variables guarantees uniformity. However, Chakraborty et al. (2013) reported that such r does not scale in practice; in contrast, r = 3 scales better and ensures near-uniformity. Unigen2 6 (Chakraborty 2015) develops these ideas further, giving stronger uniformity guarantees.

Counting-based sampling (KUS, Smarch, and Spur)
In Section 7.1.4 of Knuth (2009), Knuth showed how to accomplish uniform random sampling by subsequently partitioning the SAT-solution space on variable assignments, and then counting the number of solutions of the resulting parts. Again, be a Boolean formula of v variables x 1 , x 2 , … , x v ; let #SAT( ) denote the number of solutions to ; and let r ∈ [0, 1] be a random number in the unit interval. Conceptually, the procedure works as follows: The number of solutions where x 1 is true is counted, namely #SAT( ∧ x 1 ) . x 1 follows a Bernoulli distribution with probability p 1 = #SAT( ∧x 1 ) #SAT( ) . x 1 is assigned false if r ≤ p 1 , true otherwise. Suppose x 1 is assigned false. Then, x 2 follows a Bernoulli distribution with probability p 2 = #SAT( ∧x 1 ∧x 2 ) #SAT( ∧x 1 ) , and it would be randomly assigned. The procedure advances until the last variable x v is assigned, and thus the random solution is completed.
The original algorithm by Knuth is specified on BDDs, as the probabilities required for all the possible SAT-solutions are computed just once with a single BDD traversal, and then reused every time a random configuration is generated. Oh (2017) reinvented Knuth's algorithm and was the first to implement and apply it to SPL analyses. Since then, Knuth's algorithm has been adapted to other knowledge compilation and Davis-Putnam-Logemann-Loveland (DPLL) (Davis et al. 1962) approaches. In particular, (i) the KUS 7 sampler (Sharma et al. 2018)

New: BDDSampler, a scalable and uniform sampler
Section 3 describes a new sampler called BDDSampler, which is based on Knuth's algorithm and implemented on top of the CUDD 10 library for BDDs.
According to the experimental results reported in Section 5, the only sampler that satisfies both scalability and uniformity is BDDSampler. More specifically, evidence shows that: -BDDSampler, KUS, QuickSampler, and Spur are considerably faster than Smarch and Unigen2. -In terms of uniformity, there are three types of samplers: (i) those that mostly fail to produce uniform samples (QuickSampler), (ii) those that usually work but from time to time generate non-uniform samples (KUS and Spur), and (iii) those that always produce uniform samples (BDDSampler, Smarch, and Unigen2).

Prior work on testing sampler uniformity
The following sections summarize the methods that have been devised to test the uniformity of a random sampler .
2.2.1 Method 1: Generate a massive sample with , and compare it with another one obtained simulating an ideal uniform sampler This is the most common technique in the literature (Achlioptas et al. 2018;Chakraborty 2015;Dutra et al. 2018;Plazar et al. 2019;Sharma et al. 2018). First, the total number n of SAT-solutions is counted for the Boolean formula , typically using a #SAT-solver. Having n, the generation of a uniform sample with size s is simulated as follows: imagine that numbers 1, 2, … , n are put into a box; then, s numbers are sampled with replacement from the box, guaranteeing that the probability each number has to be extracted is 1 n . For example, JHipster encompasses 26,256 valid configurations (Halin et al. 2019). Figure 1 shows the histogram of a sample ten times greater than the number of configurations ( s = 26, 256 ⋅ 10 ), which has been obtained sampling with replacement from the set {1 , 2, … , 26256} . The x-axis depicts numbers' occurrences, i.e., there are numbers that appear 0, 1, … , 27 times in the sample; the y-axis shows how frequent are those occurrences in the sample. As expected, most numbers appear ten times (see the red vertical line in Fig. 1), however, and due to randomness, some numbers appear more frequently than others.
Another sample with size s (whose value is quantified shortly), is then generated with sampler . For this sample, a counterpart histogram to Fig. 1 is obtained, representing how often solutions appear in that sample.
Finally, the uniformity of is verified by measuring the distance between both histograms, using, for instance, the Kullback-Leibler divergence (Achlioptas et al. 2018).
Unfortunately, this method has a severe limitation: it does not scale except for formulas with a small number of SAT-solutions because, to produce reliable results, s needs to be much larger than n (see Achlioptas et al. (2018);Dutra et al. (2018) for an explanation). For example, Dutra et al. (2018) propose s ≥ 5n . As the number of solutions grows exponentially with the number of variables of , the method only works for the simplest models with just a few features. Chakraborty and Meel (2019) proposed this method and implementation called barbarik. 11 The method makes a strong assumption: there is a sampler that is known to be uniform. Thus, two samples of the same size s are generated with and and, depending on the distance between the samples, i.e., on how similar they are, barbarik decides if is approximately uniform.

Method 2: Assume the existence of a uniform sampler , and compare the samples generated by both and
The key of the method is how to define "approximately" for reaching a balance between uniformity and sample size, i.e., for avoiding the large s that Method 1 requires. Two parameters, called tolerance and intolerance , adjust the definition of "uniformity" to avoid the above problems. A sampler is uniform whenever the probability p 1 , p 2 , … , p n of all n solutions is exactly 1 n . Barbarik relaxes this definition, proposing that a sampler is additive almost-uniform if p 1 , p 2 , … , p n ∈ 1− n , 1+ n . Moreover, a sampler is -far from uniformity if Chakraborty and Meel claim that s depends on and exclusively, but not on n. In particular, they state that a uniformity test with significance level = 0.1 (i.e., 0.9 probability of accepting the uniformity of a sampler when it is genuinely uniform) and Type 2 error = 0.1 (i.e., 0.9 probability of rejecting the uniformity of a sampler that is not uniform) is accomplished when = 0.6 and = 0.9 , requiring a sample size of 1, 729, 750. Unfortunately, they do not provide a detailed formal proof for these settings in Chakraborty and Meel (2019).
An evident weakness of this method is the necessity of a sampler with certified uniformity as a support lever. It is worth noting that, although an algorithm can be proven to generate uniform samples theoretically, some of its implementations may have errors. In other words, every sampling program needs to be tested, and thus Method 2 implicitly assumes the existence of another reliable uniformity testing method.

Method 3: Measure the distance between the theoretical variable probabilities with the empirical variable frequencies in a sample
Plazar et al.'s method (Plazar et al. 2019) begins computing the theoretical probability each variable x has to appear in a SAT-solution. To do so, the procedure introduced in Section 2.1.3 is adopted, calling a #SAT solver repeatedly, one time per variable. #SAT( ) gives the total number of SAT-solutions, and #SAT( ∧ x) calculates the number of solutions where x is true. Hence, the probability of x is p = #SAT( ∧x) #SAT( ) . Likewise, if x is true t times in a sample of size s, its empirical frequency is f = t s . Then, the deviation between p and f is d = 100 ⋅ |p−f | p . Finally, Plazar et al. propose two thresholds for d: (i) when d ≤ 10 for all variables, the deviations are very low, and thus sampler uniformity is accepted; (ii) when d ≥ 50 for some variables, they show very high deviations, and so uniformity is rejected. Regarding the sample size, Plazar et al. propose always using s ∼ 10 6 , independently of the number of variables of (no formal justification is given for this specific value in Plazar et al. (2019)).
Regrettably, this method often throws false negatives for variables with low probabilities. Suppose a variable has p = 0.01 . Then, a genuine uniform sampler might easily generate a sample where f is slightly different just due to randomness, e.g., f = 0.015 . Therefore, d = 100 ⋅ |0.01−0.015| 0.01 = 50 , and thus the sampler uniformity would be rejected. The chances that these types of wrong diagnoses happen increases with the number of low-probability variables, and it is worth noting that real models with numerous low-probability variables are not "corner cases"; for example, in three out of the seven configuration models analyzed in Heradio et al. (2019), more than 46% of their variables have p ≤ 0.05 : the open-source project Fiasco v2014092821, the Dell laptop configurator, and the Automotive 02 system.

Method 4: A statistical goodness-of-fit test that compares the theoretical variable probabilities with the empirical variable frequencies in a sample
In the past ), we presented a procedure called Feature Probability (FP) test, which compares the empirical feature frequencies in a sample with the theoretical feature probabilities in the whole population of SAT-solutions. Instead of using the limited Method 3 deviation measure, our FP method (i) has a robust mathematical basis, (ii) estimates the statistical significance of the results (i.e., how generalizable they are), and (iii) supports adjusting the sample size according to precise statistical criteria (i.e., Type 1 and 2 errors, and effect size).
It is worth noting that a major shortcoming of Methods 1, 2, and 3 is the large sample size they need. For instance, in Achlioptas et al. (2018) and Sharma et al. (2018), Method 1 is applied on a model called blasted_case110 with 287 variables, requiring s = 4 ⋅ 10 6 SAT-solutions. In Chakraborty and Meel (2019), Method 2 is used on blasted_case110 as well, needing this time 1, 729, 750 SAT-solutions to ensure probability errors of Type 1 = 0.1 and Type 2 = 0.1 . In contrast, our FP test provides stronger test guarantees ( = 0.01 and = 0.01 ) for blasted_case110 with a minimal sample size of 13,027 solutions (i.e., a 99.25% sample size reduction with respect to Method 2).

New: an improved goodness-of-fit test
Section 4 presents a new procedure that improves Method 4 by, instead of examining the variable probabilities, analyzing how the number of variables assigned to true distributes along the SAT-solutions. We show in Section 5 that the new method requires even smaller samples, thus widening the support for testing samplers' uniformity on larger models. For example, the sample size our new method requires for blasted_case110 with = 0.01 and = 0.01 becomes 6,563 solutions.

Recap
Excluding the methods presented in this article and in our conference paper, there are serious practical problems in applying existing ways to test for sampler uniformity. We provide experimental evidence in Section 5 that our improved goodness-of-fit test is superior to prior work as it requires the smallest sample size of all existing tests, thus enabling the verification of samplers' uniformity in large models. As we will see, this highly increases the test sensitivity to detect samplers' uniformity flaws. Moreover, results show that our test provides (i) valid judgements, which are consistent with the verdicts given by the alternative methods proposed in the literature, and (ii) reliable judgements, which remain consistent when the test is applied repeatedly to the same model and sampler.

The BDDSampler Tool
This section describes BDDSampler: a sampler that uses Binary Decision Diagrams (BDDs). A practical example how configuration models can be translated into Boolean formulas is presented in Section 3.1. Then, a BDD encoding of a Boolean formula is covered in Section 3.2. Finally, how BDDSampler works is explained in Section 3.3.

From configuration models to Boolean formulas
Let us start with an example to help to explain BDDSampler and our samplers' uniformity test. As already mentioned in this paper's introduction, BusyBox supports the inclusion/ exclusion of a number of features at compile time. These features and their interrelationships are specified with a configuration language named Kconfig which is used in many other relevant open-source projects (Berger et al. 2013), such as the Linux Kernel, axTLS, EmbToolkit, Freetz, etc. Figure 2 shows an excerpt of the Kconfig specification of BusyBox v1.23.2. There are several configs encoding six features and their interdependencies. All features (STATIC, PIE, ..., FEATURE_SHARED_BUSYBOX) are Boolean (see the bool keyword in Lines 2, 4, ..., 15), meaning that they can be either selected or deselected. Configs trigger a prompt to request the user for their Boolean feature value, e.g., Build BusyBox as a static binary (no shared libs) in Line 2. Finally, some dependencies between features are set, e.g., according to the depends sentence in Line 10, BUILD_LIBBUSYBOX can only be selected if none of the following features are selected: FEATURE_PREFER_ APPLETS, PIE, neither STATIC.
The graph in Fig. 3 depicts the entire BusyBox configuration model, which includes 613 features and 530 inter-dependencies; nodes represent features, and edges depict dependencies. The Kconfig excerpt in Fig. 2 is zoomed in Fig. 3.
Given the configuration models' complexity, they are usually translated into Boolean formulas that are then processed with logic engines. For instance, Eq. 1 is the Boolean encoding of Fig. 2 (a detailed explanation of how to convert Kconfig specifications into Boolean formulas is given in Fernandez-Amoros et al. (2019)). In this section and the following one, we explain how to use BDDs for (i) generating random samples from the formulas, and (ii) testing the uniformity of an input sampler. (1)

A brief introduction to BDDs
A BDD (Bryant 1986) encodes a Boolean formula as a rooted directed acyclic graph composed of terminal and non-terminal nodes. Terminal nodes are represented as and , and non-terminal nodes are labeled with the formula variables. Two edges, named low and high, come out of every non-terminal node. Low is depicted with a dashed line ( ⤏ ), and high with a solid line ( → ). A BDD encodes every possible assignment of the formula variables as a path that descends from the root to the terminal nodes, going through solid lines when the corresponding variables are assigned to true and through dashed lines otherwise. An assignment is satisfiable, i.e., it evaluates the formula to true, whenever the traversed path ends at . Figure 4 depicts a BDD that encodes the BusyBox excerpt specified by Eq. 1. A configuration whose only activated features are BUILD_LIBBUSYBOX and FEATURE_ SHARED_BUSYBOX conforms with the constraints (i.e., it is valid) and so it corresponds to the path BUILD_LIBBUSYBOX → FEATURE_INDIVI-DUAL ⤏ FEATURE_ SHARED_BUSYBOX → FEATURE_PREFER_APP-LETS ⤏ STATIC ⤏ PIE ⤏ . In contrast, as STATIC and PIE are mutually exclusive, no configuration includes them simultaneously. Thus, all paths with solid lines coming out of both STATIC and PIE finish at .
BDDs are typically ordered and reduced. A BDD is ordered when its variables are in the same position, called index, in every path from the root to the terminal nodes. For example, in Fig. 4, STATIC (whose index is 4) always goes before PIE and after FEATURE_PRE-FER_APPLETS (whose indices are 5 and 3, respectively). A BDD is reduced if it is free of redundant information. For instance, every blue/dark-shaded node in Fig. 4 is superfluous because both of its edges point to the same node and thus the formula evaluation is identical whether these variables are assigned to true or false. Consequently, these unnecessary tests are avoided in the reduced BDD in Fig. 5 to save computer memory.
It is worth noting that the variable ordering chosen to build the BDD has a tremendous impact on its size. Whereas a BDD can be reduced optimally (the reduction procedure was presented in the seminal article (Bryant 1986)), obtaining the best variable arrangement that minimizes its size is an NP-problem (Chapters 8 and 9 of Meinel and Theobald (1998) provide a comprehensive discussion on this topic). Several variable ordering heuristics Fernandez-Amoros et al. 2019;Mendonça 2009;Narodytska and Walsh 2007) have been proposed for the specific case of configuration model formulas. As reported in Section 5, we have been able to synthesize BDDs for large configuration models, with up to 17,000 features, by using these heuristics.

How BDDSampler works
BDDSampler takes an ordered and reduced BDD as input and generates random configurations in a two-step process described by Algorithms 1 and 2. Figure 5 summarizes Algorithm 1 computations for our running example. The algorithm decorates each non-terminal node with its probability of reaching the terminal if the associated variable is set to true. Algorithm 1 proceeds in a bottom-up fashion, collecting the number of SAT solutions that can be produced by its low and high children (solLO and solHI in Lines 8-9), adding them up (sol in Line 10), and then computing the ratio corresponding to the high child (pr in Line 11). As the BDD is reduced, Algorithm 1 adjusts the solution counts in Lines 8-9 for the removed nodes with the expression 2 index(n LO|HI )−index(n)−1 . For traversing efficiently the BDD, Algorithm 1 uses Bryant's method (Bryant 1986) as follows: the algorithm is called in Line 12 with the BDD root as argument and with a Boolean mark for every node being either all true or all false; then, it explores all nodes by recursively visiting the low and high children (Lines 6 and 7). Whenever a node is visited, its mark value is complemented (Line 2). Comparing the node with its children's marks, it is decided if the children have already been visited. The method ensures that each node is visited exactly once and that, when the traverse finishes, all node marks have the same value. Whereas Algorithm 1 is run once as an initialization method, Algorithm 2 needs to be run as many times as configurations we want to generate. Algorithm 2 performs a random walk from the root to the terminal . When a non-reduced node is visited, the path is selected randomly according to its probability (Lines 11-16): if the node probability is p, then its low and high edges are chosen with probabilities 1 − p and p, respectively. Regarding the reduced nodes, the generated configuration will be valid no matter if their variables are set to true or false (that is the reason why these nodes were removed). Thus their value is chosen randomly with a 1/2 probability by taking into account that a reduced node index may be less than the BDD root index (Lines 6-7) or greater (Lines 17-18).
Algorithm 2 is remarkably fast since its time complexity is proportional to the number of indices (i.e., of variables), not the number of nodes in the BDD. Moreover, multiple instances of Algorithm 2 can be run in parallel over the same BDD, as the node probabilities are read but not modified.
Finally, BDDSampler is built on top of CUDD 3.0. 12 As other modern BDD libraries like Sylvan (van Dijk 2016), CUDD uses a technique called complement edges (Brace et al. 1990) to save nodes. With this technique, edges are enriched with a complement attribute that removes the need of having two terminal-nodes (basically, when an edge has the complement attribute enabled, the only terminal node is interpreted as its negation). Accordingly, BDDSampler tweaks Algorithms 1 and 2 to work with complement edges. We have decided to show the algorithms for regular BDDs without complement arcs for simplicity.
The BDDs of 218 models, which will be used in Section 5 to perform our experimental evaluation, are available at https:// doi. org/ 10. 5281/ zenodo. 45149 19 in the DDDMP format that CUDD uses for complement edge BDDs. Figure 6 sketches our approach to verify that a sampler generates uniform random samples of a model that is encoded as a Boolean formula. The method compares empirical information about a sample with theoretical information about the whole population of SATsolutions that the model represents.

The SFpC goodness-of-fit test
In statistics, the procedures for examining how well a sample agrees with the population distribution are known as goodness-of-fit tests (D'Agostino and Stephens 1986). They require characterizing both the sample and the population in terms of a quantitative measure. In particular, we propose the distribution of the number of variables assigned to true among all SAT-solutions, called the Selected Features per Configuration (SFpC) test. For instance, Fig. 7 compares the theoretical distribution of all 7.428 ⋅ 10 146 SAT-solutions of the BusyBox model with the distribution of 17,738 configurations generated with the samplers BDDSampler and QuickSampler, Fig. 7a,b respectively. The justification for this sample size 17,738 is given in Section 4.2.
The distribution of the whole population of SAT-solutions of a model can be computed with the Product Distribution (PD) 13 algorithm we proposed in Heradio et al. (2019). PD takes the BDD encoding of a model as input, and as explained in Section 3.D of Heradio et al. (2019), its time complexity is O(nv 2 ) , where n is the number of BDD nodes and v the number of model variables. Accordingly, PD scales for large models. For instance, on an Intel(R) Core(TM) i7-6700HQ, it took 2.74 minutes to compute the distribution of the Automotive02 model (Krieter et al. 2018), which with 17,365 variables and 321,897 clauses encompasses 5.26 ⋅ 10 1,441 SAT-solutions.
As the theoretical histogram shows in Fig. 7a, the smallest and largest BusyBox configurations have 6 and 571 features activated, respectively. 95% of the configurations have between 277 and 327 variables assigned to true.
The BDDSampler histogram (Fig. 7a) agrees with the normally distributed population. However, the QuickSampler histogram (Fig. 7b) is bimodal where most configurations have 100 or 200 features approximately, quite different from the theoretical histogram.
After exploring the sample's goodness-of-fit graphically, it is desirable to advance towards a more formal test that provides an accurate numerical quantification. A good candidate to measure the distance/difference between the sample and population distributions is the Kullback-Leibler divergence 14 (Cover and Thomas 2006). For discrete probability distributions P and F specified on the same probability space , the Kullback-Leibler divergence from F to P is defined as: However, the Kullback-Leibler divergence is not symmetric, and thus it cannot rigorously be considered a metric (Lin 1991). For this reason, we use its symmetrical and normalized version, which is named Jensen-Shannon divergence (Cover and Thomas 2006;Lin 1991) and defined as: where M = 1 2 (P + F). In our case, vectors F and P are defined as follows: -F = [f 0 , f 1 , … , f n ] stores the SAT-solution frequency distribution (i.e., the red histograms in Fig. 7). That is, -P = [p 0 , p 1 , … , p n ] stores the theoretical SAT-solution probability distribution of Fig. 7. That is, To avoid worthless comparisons, all i-elements with p i = 0 are removed from F and P because, as all solutions in the sample are guaranteed to be valid, the corresponding f i 's are necessarily 0 as well. For instance, the BusyBox model has 613 variables, but all valid configurations have between 6 and 571 variables assigned to true. Therefore, {f 0 , f 1 , … , f 5 , f 572 , f 573 , … , f 613 } and {p 0 , p 1 , … , p 5 , p 572 , p 573 , … , p 613 } are deleted from F and P, respectively. The Jensen-Shannon divergence JSD(P||F) measures to what extent the difference between F and P is greater than expected by chance if F corresponded to a uniform random sample. In the extreme cases, JSD(P||F) = 0 when F totally matches P, and JSD(P||F) = 1 when the F completely disagrees with P.
Nevertheless, JSD is a mere distance/difference metric, i.e., we cannot tell if JSD is significantly greater than expected due to randomness. Therefore, a statistical inference test is needed to quantify how generalizable the obtained distance is, i.e., a test that estimates the probability of a specific value of JSD(P||F) assuming that the sampler is genuinely uniform. In the case that the estimated probability is excessively low (below a significance level ), it is unlikely that the disagreement between F and P is due to chance, and so we can conclude that the sampler is not uniform.
Let s be the sample size (whose value we compute in Section 4.2), and m the number of elements in P after having removed those with p i = 0 . According to the proof given by Grosse et al. in Section 4.C of Grosse et al. (2002), 2s(ln2)JSD(P||F) has a 2 distribution with m − 1 degrees of freedom. As a result, a Chi-Squared goodness-of-fit test built upon the statistic 2s(ln2)JSD(P||F) guides us to decide whether the sampler is uniform. In our BusyBox running example, s = 17, 738 and m = 613 − 6 − 42 = 565 , hence if the sampler is uniform then 2 ⋅ 17, 738(ln2)JSD(P||F) should follow a 2 565 distribution. In contrast to typical Null Hypothesis Significance Tests (NHSTs), where the null hypothesis H 0 states the opposite to what the researcher pursues to demonstrate, goodnessof-fit tests are a special case of NHSTs where H 0 is: "the sample agrees with the population" (see Chapter 3 of D'Agostino and Stephens (1986) for a detailed description of Chi-Squared goodness-of-fit tests). Coming back to our case study, let us set the threshold = 0.01 to test the BusyBox samples generated with: To sum up, the test corroborates numerically the histogram comparison in Fig. 7: BDDSampler generated a uniform sample, but QuickSampler did not.

Sample size estimation
The reliability of a Chi-Squared goodness-of-fit test depends on the following parameters (see Table 1): -The significance level sets the probability of making a Type 1 error, i.e., the probability of rejecting H 0 when it is indeed true (false positive). It is worth noting that is also the threshold for rejecting H 0 (i.e., H 0 is rejected whenever the p-value ≤ ). -sets the probability of making a Type 2 error, i.e., the probability of accepting a false H 0 (false negative). The expression 1 − is called the test's power, i.e., the probability of rejecting a false H 0 .
When H 0 is false, it is false to some degree. That degree is measured by another parameter called the effect size (Lakens 2013). In particular, Cohen (1988) proposes the index w for measuring the effect size in Chi-Squared tests. As a rule of thumb, w values of 0.1, 0.3, and 0.5 correspond to small, medium, and large effect sizes, respectively. Interestingly, sample size, effect size, , and have an intimate relationship in NHSTs: given any three of them, the fourth can be determined. In Section 7.3 of Cohen (1988), Cohen provides different power tables to estimate the minimum sample size required to ensure the reliability of a Chi-Squared test given the values of , , w, and 2 's degrees of freedom. Nowadays, there is available statistical software that provides those tables, e.g., the R package pwr 15 (see Chapter 10 of Kabacoff (2011)) and the G*Power 16 tool (Faul et al. 2007).
In the previous section, we saw that the goodness-of-fit of any sample from the Bus-yBox configuration model can be undertaken with a Chi-Squared test with 565 degrees of freedom. Then, according to Cohen's power tables, the required sample size is 17,738 configurations when = = 0.01 , and w = 0.1.

Empirical evaluation
This section describes the experimental evaluation of our approach using the Goal/ Question/Metric (GQM) method (van Solingen and Berghout 1999). As Fig. 8 shows, the evaluation pursues two goals (G1 and G2), which are refined into five questions (Q1-Q5) that are answered using different metrics.
The following points summarize our evaluation's goals and questions: Section 5.1 presents the experimental setup. As Fig. 9 shows, three experiments E1-E3 were performed to solve the questions (e.g., Experiment E2 supported answering Questions Q2, Q3, and Q5). Sections 5.2-5.6 describe these experiments and the specific metrics used to answer the questions. The detailed results and all the material needed to replicate our experiments are available in the public repositories presented in Section 8.

Experimental setup
The samplers were tested against a suite of 218 models encoded as Boolean formulas in all of the following formats:   Table 2 describes the nine configuration models (the largest model is also referred as Automotive02 in the SPL literature (Krieter et al. 2018)).
The histogram in Fig. 10 shows the model size distribution according to their number of variables. Since there is a wide range from the smallest model in the benchmark to the largest one (from 14 to 18,570 variables), the scale has been logarithmically transformed to shrink the range and thus facilitate the figure interpretation (see Chapter 5 of Winter (2020) for an explanation on logarithmic scale transformations). The scatter plot in Fig. 10 represents the model sizes in terms of their variables and clauses. The grey regression line shows that Log 2 (#Clauses) depends on 1.35 + 1.03 ⋅ Log 2 (#Variables) . Points corresponding to configuration models are labeled, and models with more and fewer clauses than those predicted by the linear regression are colored red and blue, respectively. Note that in the interval [9.88, 11.2] of Log 2 (#Variables) there are only 5 models, and all of them have fewer clauses than predicted. As these models are simpler in terms of clauses, processing them requires less time than expected for their variable number, and thus regression curves in Figs. 12, 17, and 16 will show positive convexity in that interval.
The experiments were run on an Intel(R) Core(TM) i7-6700HQ, 2.60GHz, 16GB RAM, operating Linux Ubuntu 19.10. Samplers were executed on a single thread (i.e., with no parallelization), and without considering any Boolean formula preprocessing, such as Minimal Independent Support (MIS) (Ivrii et al. 2016).

Q1: Scalability of samplers
The following experiment E1 was undertaken to obtain a sample set S1 for answering Q1. Each sampler generated a sample with one thousand configurations for every model in the benchmark. The timeout for each sample generation was set to one hour. Table 3   The number of variables and clauses are highly correlated (Pearsons' r=0.97) summarizes the generation times for the configuration models. The histogram in Fig. 11 shows the percentage of samples that each sampler was able to generate. In total, 257.92 hours (10.75 days) of CPU time were needed for generating the samples (or reaching the timeout).

Q2: Uniformity of samplers
The following experiment E2 was carried out to obtain a sample set S2 for answering Q2, Q3, and Q5. Each sampler was run to generate a sample for every model in the benchmark. As Section 5.4 will explain in detail, the number of configurations per sample was estimated for = 0.01, = 0.01 , and w = 0.1 . The timeout for each sample generation was set to one hour. In total, 373.5 hours (15.56 days) of CPU time were needed for generating the samples (or reaching the timeout). The histogram in Fig. 13 summarizes the results. Nearly all samples produced by BDDSampler, Smarch, Spur, and Unigen2 obtained high p-values in the range (0.9, 1]. In contrast, KUS and Quick-Sampler generated many samples with p-values in the interval [0, 0.1]. Since is set to 0.01, remember from Section 4.2 that a p-value less or equal to 0.01 means rejecting the uniformity hypothesis. Likewise, a p-value close to 1 reflects that the sample greatly supports the uniformity hypothesis. Table 4 summarizes the p-values for the configuration models in detail.

Fig. 11
Percentage of samples that each sampler was able to generate (sample size = 1,000 configurations; timeout = 1 hour) KUS and Spur implement Knuth's sampling procedure (see Section 2.1.3). Accordingly, they should be uniform "by design". Moreover, the KUS and Spur empirical validations in Sharma et al. (2018) and Achlioptas et al. (2018), respectively, did not detect any problem (though only small models with a few hundred variables were used). However, our inspection using more varied and larger models revealed the following uniformity flaws: -As Fig. 13 shows, 16.4% of the KUS samples got a p-value in [0, 0.1]. Furthermore, in 15.89% of the cases, the p-values were less than = 0.01 , and thus rejected the uni- formity hypothesis. Figure 14 shows four examples were KUS uniformity was rejected. Each subfigure compares, for a particular model, the histogram of the SAT-solution distribution of the whole population (in blue) with the distribution of the generated sample (in red). Unfortunately, the rejected samples do not show any clear pattern that explains the causes of KUS failures. For instance, KUS exhibits difficulties with small models (blasted_case63) but also with large ones (blasted_squaring26), with normal distributions (blasted_case63 and s1238a_7_4) and non-normal distributions (s526_15_7 and blasted_squaring26), with left-skewed distributions (s1238a_7_4) and right-skewed distributions (blas-ted_case63), etc. -In our previous evaluation , we detected that Spur generated uniform samples for all models except for EmbToolkit. We thought our test was making a Type 1 error, misjudging the sampler uniformity because an extremely low p-value happened due to randomness. However, when we checked the samplers' uniformity with our new test, we obtained exactly the same results for this particular model, which raised our suspicions. We repeated the experiment one thousand times and Spur never generated a uniform sample for EmbToolkit. Figure 15 shows the results for two of those experiment repetitions. In this case, Spur's error always displays the same pattern: the solutions in the sample have more variables assigned to true than in the population. JHipster

Q3: Scalability of the SFpC test
Two factors influence the scalability of a uniformity test when applied to a particular model and sampler: (i) the number of configurations the test needs to consider, and (ii) the time the test invests in analyzing those configurations. Concerning the first factor, and as discussed in Section 2, the methods proposed in the literature to verify samplers' uniformity require colossal sample sizes with millions of configurations. Thus uniformity had been tested on trivial models so far, with a few hundred variables. To support evaluating uniformity over more complex models, in Heradio et al. (2020) we proposed the FP test, which compares the variable frequency distribution of a sample with the variable probability distribution of the entire population. With this test, we could validate samplers' uniformity on models with more than seventeen thousand variables . Figure 16 compares the sample sizes that the FP test needs (in red) with the sample sizes our new SFpC test requires (in blue), showing that the latter needs fewer configurations in most cases.
In Fig. 16, each model's sample size was determined with the procedure described in Section 4.2. In particular, the R package pwr 24 (Kabacoff 2011) was used to perform Cohen's power tables calculations. To ensure the highest reliability of the samplers' uniformity tests (see Section 5.4), we set = 0.01, = 0.01 , and w = 0.1 . That is, the 2 test confidence level was fixed to 99%, the power to 99%, and the effect size to small. Table 5 compares in detail the samples sizes obtained for the configuration models.
The sample size depends on the model's degrees of freedom in both the FP and the SFpC tests. Nevertheless, each test defines degrees of freedom in a different way. The degrees of freedom of the FP test df FP are the number of variables (minus one) whose probability is neither zero nor one (see Section 3 of Heradio et al. (2020)). The degrees of freedom of the SFpC test df SFpC are the number cases (minus one) for which there is at least one valid configuration with a particular number of variables assigned to true (see Section 4.1). As Fig. 16 shows, in practice, df SFpC ≤ df FP and therefore the SFpC test consumes fewer configurations.
Regarding the time SFpC requires to analyze the generated configurations, once the theoretical distribution of SAT solutions is known, the remaining computations can be performed extremely fast (see Section 4.1). So, the SFpC's potential bottleneck is getting such distribution with the algorithm PD. Figure 17 shows the time it took to compute the theoretical distribution for each model in the benchmark, ranging from 0.02 seconds to 14.14 minutes. Table 6 details the times for the configuration models. It is worth noting that the model which needed the longest time was s1196a_3_2, which is an industrial SAT formula (thus not included in Table 2). This illustrates the dependency that BDDs have on variable ordering heuristics. Whereas this model has a medium-size CNF formula (690 variables and 1,805 clauses), the BDD we synthesized was huge (2,284,697 nodes). In contrast, for LargeAutomotive (17,365 variables and 321,897) a more reduced BDD was obtained (30,432 nodes), and hence computing its theoretical SAT-solution distribution just took 2.74 minutes. Log 2 (#Variables)

SFpC test
The SFpC test requires smaller samples than the FP test

Q4: Validity of SFpC
Two criteria are typically used for assessing measurement quality (Trochim et al. 2015): validity and reliability. Since we are interested in the quality of SFpC measurements, validity will refer to what extent SFpC actually measures uniformity, and reliability will refer to repeatability, i.e., to the consistency of the results obtained when SFpC is applied several times to the same sampler and model. This section examines SFpC's validity, and the next section deals with SFpC's reliability.
To evaluate SFpC's validity, we followed a convergent strategy (Trochim et al. 2015) by examining the degree to which SFpC results are similar to those obtained by other uniformity tests. Table 7 summarizes the uniformity verdicts reported in the literature. There is a total consensus that Unigen2 is uniform and QuickSampler is not. SFpC results are consistent with this consensus.
As we mentioned in Section 5.4, before the publication of FP in Heradio et al. (2019), the literature relied on limited tests that only could handle the simplest models with a few hundred variables. As more complex are considered, the chances to detect samplers' additional flaws increases. In other words, the sensitivity of FP and SFpC is higher than their predecessors. Accordingly, we performed a new Experiment E3 focused on checking the convergent validity of FP and SFpC in detail. A new sample set S3 was procured by asking each sampler to generate a sample for every model in the benchmark. Then, the uniformity of the samples was analyzed with both FP and SFpC. Since FP generally needs larger samples than SFpC (see Section 5.4), the sample sizes were set according to FP requirements.  Pearson's correlation coefficient of the p-values obtained with FP and SFpC was = 0.953 , and Cohen's kappa of the test verdicts (i.e., rejection/acceptance of sampler's uniformity) was = 0.942 . As FP and SFpC results were numerically highly correlated, and their final judgments were remarkably consistent, convergent validity was successfully confirmed.

Q5: Reliability of SFpC
SFpC's reliability was evaluated with a test-retest strategy (Trochim et al. 2015) by comparing its results with the sample sets S2 and S3. Pearson's correlation coefficient of the p-values calculated with SFpC in S2 and S3 was = 0.950 , and Cohen's kappa of the corresponding final judgments (i.e., rejection/acceptance of sampler's uniformity) was = 0.939 . As a result, SFpC's reliability was positively evaluated.

Discussion
The experimental results indicate that SFpC supports testing samplers' uniformity on complex models with thousands of variables and constraints, providing valid and reliable judgments. The results show that the only sampler that satisfies both scalability and uniformity is BDDSampler.   (Krieter et al. 2020). Furthermore, the applicability of BDDSampler goes beyond the SPL domain since sampling is also needed in artificial intelligence (Chakraborty and Meel 2019;Dutra et al. 2018;Roy et al. 2018), integrated circuit simulation and verification (Hëbner and Becker 2011;Naveh et al. 2006;Yuan et al. 2002), etc. 2. SFpC can be used to debug and thus improve existing samplers (see Figs. 14 and 15), or to validate future samplers. The importance of samplers' validation is well recognized by the SPL community. Recently, in the SPLC 25 th edition, there was a session dedicated to "Sampling, variability analysis and visualization", where two tools for samplers' evaluation were presented: BURST (Acher et al. 2021) and AutoSMP (Pett et al. 2021). Those tools could be enhanced by integrating SFpC; e.g., BURST relies on Barbarik, which has inferior performance than SFpC (see Section 2.2.2). Again, the interest in samplers' validation is not restricted to SPLs. In fact, most uniformity tests have been proposed by artificial intelligence researchers, mainly from the SAT community (Achlioptas et al. It is worth noting that our work has the following limitation: both BDDSampler and SFpC rely on BDD technology. Synthesizing the BDD encoding of a variability model is sometimes unattainable. This is because the variable ordering chosen to build a BDD dramatically impacts its size, and finding the optimal ordering is an NP-problem. So the search is approached heuristically without guarantees. This problem principally affects BDDSampler, but not much SFpC. 1. BDDSampler receives a model's BDD encoding as input. If the available heuristics fail to find an adequate variable ordering, BDDSampler becomes useless, and an alternative technology (e.g., SAT) must be used. 2. SFpC can evaluate any sampler, independently of the technology in which it is built. Indeed, Section 5 reports the SFpC use for samplers implemented with BDDs (BDDSampler), #SAT-solvers (QuickSampler, Smarch, Spur, and Unigen2), and d-DNNFs (KUS). The impossibility of creating a BDD for a particular model only prevents SFpC from using it as part of the benchmark presented in Section 5.1. Currently, the benchmark includes 218 models with their respective BDDs. In our opinion, the models' great variety in terms of size (from 14 to 18,570 variables) and application domain (automotive industry, embedded systems, a laptop customization system, a web application generator, integrated circuits, etc.) is adequate to ensure samplers' verification to a great degree.
Finally, the following threats to our study's validity should be taken into account: 1. There is no absolute guarantee that the samplers we have certified as uniform behave non-uniformly in models not included in the benchmark. 2. Our experimental design discards two potential confounders for evaluating the scalability of samplers: -Sampling parallelization. Although any sampler can be run in a multi-core fashion, thus producing samples concurrently, only Unigen2 and Smarch were specifically designed for that. The focus of our evaluation is on the sampling techniques, not on how those techniques can be parallelized efficiently. Therefore, all samplers were run on a single thread. -Use of preprocessing techniques. There are some methods to preprocess the model Boolean formulas for speeding up further computations. For example, Ivrri et al.
(2016) claim that sampling with the formulas' MIS produces 2-3 orders of magnitude performance improvement. Nevertheless, Plazar et al. (2019) empirical results contradict that, showing no running time difference between sampling from the whole formula or the MIS. Anyway, we decided to focus on the sampling techniques, not on how any additional preprocessing methods may impact those techniques.

Conclusions
The number of SAT solutions that configuration models encompass can be so large that most analyses cannot be performed neither examining every valid configuration, nor calling a SAT solver massively. Statistical inference opens an alternative way to address these problems by working with a tractable sample accurately predicts results for the entire space. However, the laws of statistical inference impose an indispensable requirement that samples must be collected at random, i.e., the configuration space needs to be covered uniformly. Two major research challenges on SAT-solution random sampling have been addressed in this paper: we (i) developed a new random sampler, called BDDSampler, and (ii) proposed a goodness-of-fit test to verify samplers' uniformity. Our new test requires the least sample size of all existing methods, in the literature, supporting the samplers' uniformity assessment even on colossal models and the most strict reliability arrangements. Using this test, we have undertaken the empirical evaluation of six state-of-the-art samplers, revealing that only BDDSampler satisfies both uniformity and scalability.
It is worth remarking that BDDSampler works with a BDD encoding of a configuration model as input, and synthesizing such BDD is not always feasible as it depends on finding an adequate variable order heuristically. Our work deals with this limitation by exposing uniformity bugs on two scalable samplers based on alternative technologies (KUS on d-DNNFs and Spur on #SAT), thus facilitating their fixing. Having available all these samplers would support coping with the variated difficulties that the Boolean encoding of configuration models poses (e.g., large intractable CNFs, enormous BDDs, etc.).

Material
Following open science's good practices, our software artifacts are available publicly.