Refined Grey-Box Fuzzing with SIVO

We design and implement from scratch a new fuzzer called SIVO that refines multiple stages of grey-box fuzzing. First, SIVO refines data-flow fuzzing in two ways: (a) it provides a new taint inference engine that requires only logarithmic in the input size number of tests to infer the dependency of all program branches on the input bytes, and (b) it deploys a novel method for inverting branches by solving directly and efficiently systems of inequalities. Second, our fuzzer refines accurate tracking and detection of code coverage with simple and easily implementable methods. Finally, SIVO refines selection of parameters and strategies by parameterizing all stages of fuzzing and then dynamically selecting optimal values during fuzzing. Thus the fuzzer can easily adapt to a target program and rapidly increase coverage. We compare our fuzzer to 11 other state-of-the-art grey-box fuzzers on 27 popular benchmarks. Our evaluation shows that SIVO scores the highest both in terms of code coverage and in terms of number of found vulnerabilities.


INTRODUCTION
Fuzzing is the automatic generation of test inputs for programs with the goal of finding bugs.With increasing investment of computational resources for fuzzing, tens of thousands of bugs are found in software each year today.We view fuzzing as the problem of maximizing coverage within a given computational budget.The coverage of all modern fuzzers improves with the computation budget allocated.Therefore, we can characterize the quality of a fuzzer based on its rate of new coverage, say, the number of new control-flow edges exercised, per CPU cycle on average.
Broadly, there are three types of fuzzers.Black-box fuzzers do not utilize any knowledge of the program internals, and are sometimes referred to as undirected fuzzers.White-box fuzzers perform intensive instrumentation, for example, enabling dynamic symbolic execution to systematically control which program branches to invert in each test.Grey-box fuzzers introduce low-overhead instrumentation into the tested program to guide the search for bugtriggering inputs.These three types of fuzzers can be combined.For instance, recent hybrid fuzzers selectively utilize white-box fuzzers in parallel to stand-alone grey-box fuzzers.Of the three types of fuzzers, grey-box fuzzers have empirically shown promising cost-to-bug ratios, thanks to their low overhead techniques, and have seen a flurry of improved strategies.For example, recent grey-box fuzzers have introduced many new strategies to prioritize seed selection, byte mutations, and so on during fuzzing.Each of these strategies work well for certain target programs, while being relatively ineffective on others.There is no dominant strategy that works better than all others on all programs presently.
In this paper, we present the design of a new grey-box fuzzer called Sivo that generalizes well across many target programs.Sivo embraces the idea that there is no one-size-fits-all strategy that works universally well for all programs.Central to its design is a "parameterization-and-optimization" engine to which many specialized strategies and their optimization parameters can be specified.The engine dynamically selects between the specified strategies and optimizes their parameters on-the-fly for the given target program based on the observed coverage.The idea of treating fuzzing as an optimization problem is not new-in fact, many prior fuzzers employ optimization either implicitly or explicitly, but they do so partially [4,22,30,35].Sivo differs from these works conceptually in that it treats parameterization as a first-class design principleall of its internal strategies are parameterized.The selection of strategies and determination of all parameter values is done dynamically.We empirically show the power of embracing complete parameterization as a design principle in grey-box fuzzers.
Sivo introduces 3 additional novel refinements for grey-box fuzzers.First, Sivo embodies a faster approximate taint inference engine which computes taint (or sensitivity to inputs) for program branches during fuzzing, using number of tests that are only logarithmic in the input size.Such taint information is helpful for directed exploration in the program path space, since inputs influencing certain branches can be prioritized for mutation.Our proposed refinement improves exponentially over a recent procedure to calculate taint (or data-flow dependencies) during fuzzing [12].Second, Sivo introduces a light-weight form of symbolic interval reasoning which, unlike full-blown symbolic execution, does not invoke any SMT / SAT solvers.Lastly, it eliminates deficiencies in the calculation of edge coverage statistics used by common fuzzers (e.g.AFL [37]), thereby allowing the optimization procedure to be more effective.We show that each of these refinements improves the rate of coverage, both individually and collectively.
We evaluate Sivo on 27 diverse real-world benchmarks comprising several used in recent work on fuzzing and in Google OSSfuzz [15].We compare Sivo to 11 other state-of-the-art grey-box fuzzers.We find that Sivo outperforms all fuzzers in terms of coverage on 25 out of the 27 benchmarks we tested.Our fuzzer provides 20% increase in coverage compared to the next best fuzzer, and 180% increase compared to the baseline AFL.Furthermore, Sivo finds most vulnerabilities among all fuzzers in 18 of the benchmarks, and in 11 benchmark programs finds unique vulnerabilities.This provides evidence that Sivo generalizes well across multiple programs RQ1: Impact of Complete Parameterization?Fuzzers optimize for coverage.There is no single fuzzing strategy that is expected to work well across all programs.So, the use of multiple strategies and optimization seems natural.Existing fuzzers do use dynamic strategy selection and optimize the parameter value selection.For example, MOpt [22], AFLFast [4], and EcoFuzz [35] use optimization techniques for input seed selection and mutations.But, often such parameterization comes with internal constants, which have been hand-tuned on certain programs, and it is almost never applied universally in prior fuzzers.The first question we ask is what would be the result of complete parameterization, i.e., if we encode all subroutines and their built-in constants as optimization parameters.
The problem of increasing coverage is equivalent to the problem of inverting more branches.In the initial stage of fuzzing, when the number of not yet inverted branches is high, AFL mutation strategies (such as mutation of randomly chosen bytes) are successful and often help to invert branches in bulk.However, easily invertible branches soon become exhausted, and different strategies are required to keep the branch inversion going.One way is to resort to targeted inversion.In targeted inversion, the fuzzer chooses a branch and mutates input bytes that influence it.The following two questions are about refining target inversion in grey-box fuzzing.
RQ2: Efficient Taint Inference?Several fuzzers have shown that taint information, which identifies input bytes that influence a given variable, is useful to targeted branch inversion [2,6,8,12,26,34].If we want to flip a particular branch, the input bytes on which the branch condition variables depend should be mutated while keeping the other bytes unchanged.The main challenge, however, is to efficiently calculate the taint information.Classical methods for dynamic taint-tracking incur significant instrumentation overheads whereas static methods have false negatives, i.e. they miss dependencies due to imprecision.The state-of-the-art fuzzers aim for light-weight techniques for dynamically inferring taint during fuzzing itself.Prior works have proposed methods which require number of tests linear in , the size of the seed input [12].This is extremely inefficient for programs with large inputs.This leads to our second question: Can we compute useful taint information but with exponentially fewer tests?RQ3: Efficient Constraint-based Reasoning?Taint only captures whether a change in certain values of an input byte may lead to a change in the value of a variable.If we are willing to compute more expressive symbolic constraints, determining the specific input values which cause a program branch to flip is possible.The challenge is that computing and solving expressive constraints, for instance first-order SAT/SMT symbolic formulae, is computationally expensive.In this work, we ask: Which symbolic constraints can be cheap to infer and solve during grey-box fuzzing?RQ4: Precise coverage measurement?Grey-box fuzzers use coverage information as feedback to guide input generation.AFL, and almost all other fuzzers building on it, use control-flow edge counts as a common metric.Since there can be many control-flow edges in the program, space-efficient data structures for storing runtime coverage data are important.Recent works have pointed out AFL's hash-based coverage map can result in collisions [13], which has an unpredictable impact on the resulting optimization.How do we compute compressed edge counts with high precision using standard compilers for instrumentation?

OVERVIEW OF SIVO
Grey-box fuzzers instrument the target program to gather runtime profiling data, which in turn guides their seed generation strategies.The objective of Sivo is to generate seeds that increase code coverage by using better and more of the profiling data.Sivo addresses the four research questions with four refinements.
Parametrize-optimize approach (RQ1).Sivo builds on the idea of complete parameterization of all fuzzing subroutines and strategies, i.e. none of the internal parameters are hard-coded.Sivo selects strategies and parameter values dynamically based on the observed coverage statistics, using a standard optimization algorithm.Such complete parameterization and optimization inherently makes Sivo adaptable to the target program and more general, since specialized strategies that work best for the program are prioritized.To answer RQ1, we empirically show in our evaluation that this design principle individually helps Sivo outperform other evaluated fuzzers across multiple target programs.
Fast Approximate Taint Inference (RQ2).We devise a fast and approximate taint inference engine TaintFAST based on probabilistic group testing [10].Instead of testing individually for each input byte, TaintFAST tests for carefully chosen groups of bytes and then combines the results of all tests to infer the taint for each individual byte.This helps to reduce the test complexity of taint inference from () to (log ) executions of the program, where  is the number of input bytes.Thus the fuzzer can infer useful taint dependency even for very large inputs using TaintFAST.

Symbolic Interval Constraints (RQ3).
We propose inferring symbolic interval constraints that capture the relationship between inputs and variables used in branch conditions only.Instead of deductively analyzing the semantics of executed instructions, we take an optimistic approach and infer these constraints from the observed values of the inputs and branch conditional variables.The valuebased inference is computationally cheap and tailored for a common case where values of the variables are direct copies of the inputs and when branches have comparison operations (=, ≠, <, ≤, >, ≥).
We show that such a constraint system can be solved efficiently as well without the use of SAT / SMT solvers.
Compressed and Precise Edge Count Recording (RQ4).We tackle both the collision problem and the compressed edge count problem in tracking coverage efficiently during grey-box fuzzing.For the former, we show a simple strategy based on using multiple basic block labels (rather than only one as in AFL) and reduce or entirely eliminate the collisions.For the later, to improve the prospect of storing important edge counts we propose temporary coverage flushing (i.e.resetting the coverage to zero).Although this may appear to be a minor refinement in grey-box fuzzing, we find that it has a noticeable impact experimentally.

DESIGN
We present the details of our four refinements in Sections 4.1-4.4 and then show the complete design of Sivo in Section 4.5.

The Complete Parameterization Paradigm
The Sivo grey-box fuzzer aims to increase the code coverage in the fuzzed programs.Two points are central to this goal.First, fuzzed programs come in different flavors, hence the fuzzer should be flexible and adaptive.We tackle the first point with parametrization, i.e. by expanding the choice of available fuzzer subroutines.Second, a fuzzer has a few stages (i.e., selection of seeds, choice of mutations and their parameters, etc), and each one of them can be optimized.To address this point, we apply a complete optimization of all available parameters.
Parametrization.The more fuzzing subroutines are available, the higher the chance that some of them may be optimal for fuzzing the targeted program.Thus it is useful to expand the set of available fuzzing subroutines.To do so, we: • Add many fuzzing subroutines.For instance, in addition to the AFL-style vanilla mutations that do not require any dependency information (e.g.mutate random bytes), we implement data-flow strategies that utilize input dependency of program branches (e.g, mutation of dependent bytes).Besides adding new mutations, we also add more seed prioritization methods that determine how to sample a seed from the pool.Optimization.The parametrization increases the chance that potentially optimal subroutines are chosen for each program.The next step is to select which subroutines are turned on for a given program.It is critical to understand that we are not dealing with a single optimization problem.Fuzzing is a continuous process, composed of iterations that select a seed and a mutation, apply the mutation to the seed, and check on coverage increase.Thus, in each iteration we need to optimize the selection of fuzzing subroutines several times-for example, the used seed criterion and class, the mutation strategy, (potentially a number of) mutations sub-strategies, the inputs to the mutation strategy, and so on.For this purpose, we use multi armed bandits (MAB), a simple reinforcement learning algorithm.Given a set of choices, each choice providing a certain reward when selected, MAB helps to select the choices such that their accumulative rewards are maximized.The rewards are unknown and stochastic, and the selection process is continuous.Note, after MAB selects a choice, it needs to receive as a feedback the obtained reward to update its choice selection strategy.
Reducing the selection of fuzzing subroutines to MAB problem is straightforward.First, note that we consider each selection as an independent MAB problem, for instance, the optimal number of random bytes to mutate is one MAB problem.Our objective is to maximize the coverage, hence it is natural to use the additional coverage acquired from executing the choice as the MAB reward.However, this metric alone may not be accurate because some choices incur higher computational costs.Therefore, we use the additional coverage per time unit as the reward.In the conventional MAB, the distributions of rewards are stationary with some unknown mean.In our case, as the fuzzer progresses, it requires more computational effort to reach the remaining unexplored code and increase coverage.In other words, the rewards for the selection choices monotonically decrease over time.Therefore, we model our problem as MAB with non-stationary rewards and use discounting to solve it [19].For more details on application of MAB in Sivo, we refer the reader to Algorithm 1 and Section 4.5.
Besides optimizing for selections, during fuzzing we use genetic algorithms (GA) to optimize any black-box objective functions that arise during inversion of branches.More precisely, we reduce parts of the inversion problem to a black-box optimization and apply GA to speed-up the inversion.In vanilla type fuzzing, we use GA to search for optimal positions of a fixed number of bytes to mutate.
For this purpose, as an objective function we use the number of branches that are affected (i.e.change some of their variables) when selected bytes are mutated.As the number of affected branches increases, so does the chance of inverting one of them by mutating the found optimal position bytes.In data-flow fuzzing, with GA we invert non-discriminatory targeted branches.In this case, as an objective function we use the distance of the resulting branch value to the value that corresponds to branch inversion.For instance, for the branch " if(x == 5) ", the objective function is | − 5|.When the distances reaches zero, i.e. the minimum of the objective function is reached, the branch is inverted.To infer dependency of branches on input bytes, earlier fuzzers relied on the truth value of branch conditions: if changing the value of a particular byte changes the truth value of a branch, then it is inferred that the branch depends on this byte.For instance, in Fig. 1, to correctly infer the dependency of the branch at line 6, the engine first needs to select for mutation the input byte x[100] and then to change its value from any other than 40 to 40.GreyOne [12] proposed so-called fuzzing-driven taint inference FTI by switching the focus from the truth value of a branch to the value of the variables used in the branch.For instance, FTI determines the dependency of branch at line 6 on x[100] as soon as this input bytes is mutated, because this will lead to a change of the value of the variable  that is used in the branch.FTI is sound (no over-taint) and incomplete (some under-taint).Exact reasoning with provable soundness or completeness is not a direct concern in fuzzers, since they only use it to generate tests which are concretely run to exhibit bugs.

Fast Approximate Taint Inference
The prime issue with FTI, which improves significantly over many other prior data-flow based engines, is efficiency.The taint is inferred by mutating bytes one-by-one in FTI.Thus, to infer the full dependency on all input bytes, the engine will require as many executions as the number of bytes.A seed may have tens of KBs, and there may be thousands of seeds, therefore the full inference may quickly become a major bottleneck in the fuzzer.On the other hand, precise or improved branch dependency may not significantly boost fuzzer bug-finding performance, thus long inference time may be unjustified.Hence, it is critical to reduce the inference time.
The TaintFAST engine.We use probabilistic group testing [10] to reduce the required number of test executions for potential full inference from () to (log ), where  is the number of input bytes.
Instead of mutating each byte individually followed by program execution (and subsequent FTI check for each branch condition if any of its variables has changed), we simultaneously mutate multiple bytes, and then execute the program with the FTI check.We choose the mutation positions non-adaptively, according only to the value of .This assures that dependency for many branches can be processed simultaneously.
Consider the code fragment at Figure 1 (here  = 1024).We begin the inference by constructing 1024-bit binary vectors   , where each bit corresponds to one of the input bytes.A bit at position  is set iff the input byte  is mutated (i.e.assigned a value other than the value that has in the seed).Once   is built, we execute the program on the new input (that corresponds to   ) and for each branch check if any of its variables changed value (in comparison to the values produced during the execution of the original seed).If so, we can conclude that the branch depends on some of the mutated bytes determined by   .Note, in all prior works, the vectors   had a single set bit (only one mutated byte).As such, the inference is immediate, but slow.On the other hand, we use vectors with 1024  2 = 512 set bits and select 2 • log 2 1024 = 20 such vectors.Vectors  2• ,  2•+1 have repeatedly 2  set bits, followed by 2  unset bits, but with different starts.For instances, the partial values of the first 5 vectors   are given below on the right.
We execute the resulting 20 inputs and for each branch build 20-bit binary vector  .The bit  in  is set if any of the branch values changed after executing the input that corresponds to   .For instance, for the branch at line 6 of Figure 1,  = 10100110100101101010.Finally, we decode  to infer the dependency.To do so, we initialize 1024-bit vector  that will hold the dependency of the branch on input bytes-bit  is set if the branch depends on the input byte .
We set all bits of , i.e. we start by guessing full dependency on all inputs.Then we remove the wrong guesses according to  .For each unset bit  in  (i.e. the branch value did not change when we mutated bytes   ), we unset all bits in  that are set in   (i.e. the branch does not depend on any of the mutated bytes   ).
After processing all unset bits of  , the vector  will have set bits that correspond to potential dependent input bytes.Theoretically, there may be under and over-taint, according to the following information-theoretic argument:  has 20 bits of entropy and thus it can encode at most 2 20 dependencies, whereas a branch may depend on any of the 1024 input bytes and thus it can have 2 1024 different dependencies.In practice, however, it is reasonable to assume that most of the branches depend only on a few input bytes 2 , and in such a case the inference is more accurate.For branches that depend on a single byte, the correctness of the inference follows immediately from group testing theory 3 .For instance, the branch at line 6 of Fig 1 will have correctly inferred dependency only on byte x[100].For branches that depend on a few bytes, we can reduce (or entirely prevent) over-taint by repeating the original procedure while permuting the vectors   .In such a case, each repeated inference will suggest different candidates, except the truly dependent bytes that will be suggested by all procedures.These input bytes then can be detected by taking intersection of all the suggested candidates.For instance, for the branch at line 8 (that actually depends on 8 bytes), a single execution of the procedure will return 16 byte candidates.By repeating once the procedure with randomly permuted positions of   , with high probability only the 8 actual candidates will remain.
The above inference procedure makes the implicit assumption that same branches are observed across different executions.Otherwise, if a branch is not observed during some of the executions, then the corresponding bit in  will be undefined, thus no dependency information about the branch will be inferred from that execution.For some branches the assumption always holds (e.g. for branches at lines 6,8 in Figure 1).For other branches, the assumption holds only with some probability that depends on their branch conditions.For instance, the branch at line 12 may not be seen if the branch at line 10 is inverted, thus any of the 20 bits of  may be undefined with a probability of 200  2 32 .In general, for any branch that lies below some preceding branches, the probability that bits in  will be defined is equivalent to the probability that none of the above branches will inverted by the mutations 4 .As a rule of thumb, the deeper the branch and the easier to invert the preceding branches are, the harder will be to infer the correct dependency.To infer deeper branches, we introduce a modification based on forced execution.We instrument the code so the executions at each branch will take a predefined control-flow edge, rather than decide on the edge according to the value of the branch condition.This guarantees that the target branches seen during the execution of the original seed file (used as a baseline for mutation), will be seen at executions of all subsequent inputs produced by mutating the original seed.We perform forced execution dynamically, with the same statically instrumented program, working in two modes.In the first mode, the program is executed normally, and a trace of all branches and their condition values is stored.In the second mode, during execution as the branches emerge, their condition values are changed to the stored values, thus the execution takes the same trace as before.No other variables aside from the condition values are changed.Note that our procedure aims to infer taint dependencies fast and optimistically; we refer readers to Section 4.7 for a discussion on these aspects.

Solving System of Intervals
It was noted in RedQueen [2], that when branches depend trivially on input bytes (so-called direct copies of bytes) and the branch condition is in the form of equality (either = or ≠), then such branches can be solved trivially.For instance, the branch at line 1 of Figure ??, depends trivially on the byte x[0] and its condition can be satisfied by assigning  [0] = 5 (or inverted by assigning  [0] ≠ 5).Thus it is easy to satisfy or invert such branches, as long as the dependency is correctly inferred and the branch condition is equality.Similar reasoning, however, can be applied when the condition is in the form of inequality over integers.Consider the branch at line 3 of Figure ??, that depends trivially on the input byte  [1].From the type of inequality (which can be obtained from the instruction code of the branch), and the correct dependency on the input byte  [1] and the constant 100, we can deduce the branch form  [1] < 100, and then either satisfy it resulting in  [1] ∈ [0, 99], or invert it, resulting in  [1] ∈ [100, 255].In short, we can represent the solution in the form of integer intervals for that particular input byte.
Often to satisfy/invert a branch we need to take into account not one, but several conditions that correspond to some of the branches that have common variables with the target branch.For instance, to satisfy the branch at line 7, we have two inequalities and thus two intervals:  [2] ∈ [0, 200] corresponding to target branch at line 7 and  [2] ∈ [11,255] corresponding to branch at line 5.Both share the same input variable  [2] with the target branch.A solution ( [2] ∈ [11,200]) exists because the intersection of the intervals is not empty.
In general, Sivo builds a system of such constraints starting from the target branch, by adding gradually preceding branches that have common input variables with the target branch.Each branch (in)equality is solved independently immediately, resulting in one or two intervals (two intervals only when solving  ≠ , i.e.  ∈ [0,  − 1] ∪ [ + 1, ]), and then intersection is found with the previous set of intervals corresponding to those particular input bytes.Keeping intervals sorted assures that the intersection will be found fast.Also, each individual intersection can increase the number of intervals at most by 4. Thus the whole procedure is linear in the number of branches along the executed path.As a result, we can efficiently solve these type of constraints and, thus, satisfy or invert branches that depend trivially on input bytes.
Even when some of the preceding branches do not depend trivially on input bytes, solving the constraints for the remaining branches gives an advantage in inverting the target branch.In such a case, we repeatedly sample solutions from the solved constraints and expect that the non-inverted branch constraints will be satisfied by chance.As sampling from the system requires constant time (after solving it), the complexity of branch inversion is reduced only to that of satisfying non-trivially dependent branches.For instance, to reach line 10, we first solve the lines 5, 7 to obtain  [2] ∈ [11,200], and then keep sampling  [2] from this interval and hope to satisfy the branch at line 9 by chance.

More Accurate Coverage
AFL uses a simple and an elegant method to record the edges and their counts by using an array showmap.First, it instruments all basic blocks   of a program by assigning them a unique random label   .Then, during the execution of the program on a seed, as any two adjacent basic blocks   ,   are processed, it computes a hash of the edge (  ,   ) as  = (  ≪ 1) ⊕  and performs showmap[E]++.New coverage is observed if the value ⌊log 2 showmap[E]⌋ of a nonzero entry showmap[E] has not been seen before.If so, AFL updates its coverage information to include the new value, which we will refer to as the logarithmic count.
Prevent colliding edge hashes.CollAFL [13] points out that when the number of edges is high, their hashes will start to collide due to birthday paradox, and showmap will not be able to signal all distinct edges.Therefore, a fuzzer will fail to detect some of the coverage.We propose a simple solution to the collision problem.Instead of assigning only one label   to each basic block   , we assign several labels  1  , . . .,    , but use only one of them during an execution.The index of the used label is switched occasionally for all blocks simultaneously.The switch assures that with a high chance, each edge will not collide with any other edge at least for some of the indices.The number of labels required to guarantee that all edges will be unique with a high chance at some switch depends on the number of edges.Due to space restrictions we omit the combinatorial analysis.We provide a combinatorial analysis in the Appendix A on this number.In our actual implementation the size of the showmap is 2 16 and we use  = 4 labels per basic block -on average this allows around 8,000 edges to be mapped uniquely (and even 20,000 with less than 100 collisions), which is sufficiently high quantity for most of the programs considered in our experiments.By default, the index is switched once every 20 minutes.
Improve compressed edge counts.The logarithmic count helps to reduce storing all possible edge counts, but it may also implicitly hinder achieving better coverage.This is because certain important count statistics that have the same logarithmic count as previously observed during fuzzing might be discarded.For instance, if the for loop in Figure 3 gets executed 13 times, then AFL will detect this as a new logarithmic count of ⌊log 2 13⌋ = 3, it will update the coverage, save the seed in the pool, and later when processing this seed, the code block  1() will be executed as soon as the condition 1 holds.On the other hand, afterwards if the for loop gets executed 14 times, then the same logarithmic count ⌊log 2 14⌋ = 3 is achieved, thus the new seed will not be stored, therefore the chance of executing the code block  2() is much lower.In other words, to reach  2(), simultaneously the for loop needs to be executed 14 times and 2 condition needs to hold.Hence,  1() and  2() cannot be reached with the same ease despite having similar conditional dependency, only because of AFL's logarithmic count mechanism.
To avoid this issue, we propose flushing the coverage information periodically.More precisely, periodically we store the current coverage information, then reset it to zero, and during some time generate new coverage from scratch.After exhausting the time budget on new coverage, we keep only the seeds that increase the stored coverage, and continue the fuzzing with the accumulated coverage.

Design of the Whole Fuzzer Sivo
Sivo implements all the refinements mentioned so far.It uses the standard grey-box approach of processing seeds iteratively.In each iteration, it selects a seed, mutates it to obtain new seeds, and stores those that increase coverage.In Sivo (refer to the pseudo-code in Algorithm 1), the seed selection is optimized: first with MAB the currently best class and best criterion are selected, and then a seed is sampled from the pool according to the chosen class and criterion.Afterwards, the fuzzer with the help of MAB decides on the currently optimal fuzzing strategy, either vanilla (apply mutations that do not require dependency information) or data-flow (require dependency).If latter, Sivo first infers the dependency (as a combination of FTI and TaintFAST).Then, according to the chosen fuzzing strategy the fuzzer again uses MAB to select one optimal mutation strategy.The vanilla

Implementation
We implement Sivo in C++ with around 20,000 lines of code.All of the code is written from scratch, with the exception of around 500 lines related to so-called fork server5 , which is based on AFL's code.Sivo uses static instrumentation to obtain the data about the coverage and branches of the programs.More precisely, three of the four refinements require additional instrumentation of programs to implement their functionality: the accurate code coverage refinement requires lighter instrumentation, whereas TaintFAST and system solver refinements require heavier instrumentation.For this purpose, we compile a program source code with two different LLVM passes: one that utilizes relatively lightweight instrumentation used only for code coverage, and one with heavier instrumentation that provides information about the branches as well.The compilation is done by a modified version of Clang (in a fashion similar to AFL's afl-clang++).The overheads introduced by the instrumentation are given in Appendix C.

Limitations of Sivo
The taint engine TaintFAST relies on forced execution, which by definition is not sound, thus the inference is approximate.It means, the engine may introduce false positives/negatives, i.e. it may suggest dependencies of branches on incorrect input bytes.This, however, is not a real concern in fuzzing because later it leads solely to mutating incorrect input bytes, hence potentially it has only impact on efficiency 6 , and does not affect the correctness of the fuzzer in any other way.The accuracy of the engine varies between programs.
In certain cases (of particular traces), the forced execution crashes the program, and thus the inference has lower accuracy (because the corresponding  bit is undefined).In our actual implementation of TaintFAST, we prevent some of the crashes by detecting with binary search sequences of input bytes that lead to crashes, and later eliminate them from consideration.The refinement based on system of intervals is neither sound nor complete.Problems may appear due to incorrect inference of the intervals as well as due to the fact that the system describes only a partial dependency of the target branch on input bytes, i.e. includes only branches that can be presented in the form of integer intervals.Therefore, one may not assume that all of the branches can be properly inverted using this refinement.
The remaining two refinements do not have apparent limitations, aside from affecting the efficiency in some cases.

EVALUATION
We show that Sivo performs well on multiple benchmarks according to the standard fuzzing metrics such as code coverage (Section 5.2) and found vulnerabilities (Section 5.3).We evaluate the performance of each refinement in Section 5.4.

Experimental setup
Experiment environment.For all experiments we use the same box with Ubuntu Desktop 16.04, two Intel Xeon E5-2680v2 CPUs @2.80GHz with 40 cores, 64GB DDR3 RAM @1866MHz and SSD storage.All fuzzers are tested on the same programs, provided with only one initial seed, randomly selected from samples available on the internet.To keep experiments computationally reasonable, while still providing a fair comparison of all considered fuzzers, we performed a two-round tournament-like assessment.In the first round, all fuzzers had been appraised over the course of 12 hours.This interval is chosen based on Google's FuzzBench periodical reports, which shows that 12 hours is sufficient to decide the ranking of the fuzzers usually [15].The top 3 fuzzers from the first round that perform the best on average over all evaluated programs progress to the second round, in which they are run for 48 hours.
Baseline fuzzers.We evaluate Sivo in relation to 11 notable greybox fuzzers.In addition to AFL [37], we take the extended and improved AFL family: AFLFast [4], FairFuzz [20], LAF-Intel [1], MOpt [22] and EcoFuzz [35].Moreover, we include Angora [6] for its unique mutation techniques, Ankou [23] for its fitness function, and a few fuzzers that perform well on Google's OSS-Fuzz [15] platform such as Honggfuzz [32], AFL++ and AFL++_mmopt [11] (version 2.67c).To prevent unfair comparison, we omit from our experiments two categories of fuzzers.First, we exclude popular grey-box fuzzers that do not have an officially available implementation, such as CollAFL [13] and GreyOne [12].We did not implement these fuzzers from scratch due to the complexity of such a task (e.g. the authors of GreyOne report 20K LoC implementation).Second, we exclude hybrid fuzzers because their approach is basically orthogonal to traditional grey-box fuzzers and thus they can be combined.For instance, the well-known hybrid fuzzer QSYM [36] inverts branches with symbolic execution and is built on top of AFL.With minor modification, QSYM could be built on top of Sivo instead of AFL, and this hybrid may lead to an even better performance.
Programs.Our choice of programs was influenced by multiple factors, such as implementation robustness, diversity of functionality, and previous analysis in other works.Our main goal of the evaluation is comparison of fuzzers according to a few criteria (including discovery of bugs), thus we use versions of programs that have already been tested in prior fuzzer evaluations on similar criteria.Due to limited resources, we did not run the fuzzers on the latest versions to look for actual CVEs.Our final selection consists of 27 programs including: binutils (e.g.: readelf, nm), parsers and parser generators (e.g.: bson_to_json@libbson, bison), a wide variety of analysis tools (e.g.: tcpdump, exiv2, cflow, sndfile-info@libsndfile), image processors (e.g.: img2txt), assemblers and compilers (e.g.: nasm, tic@libncurses), compression tools (e.g.: djpeg, bsdtar), the LAVA-M dataset [9], etc.A complete list of the programs, their version under test, and their input parameters is available in Appendix D.
Efficiency metrics.We use two metrics to compare the efficiency of fuzzers: edge coverage and the number of found vulnerabilities.To determine the coverage, we use the logarithmic edge count because this number is the objective in the fuzzing routines of the AFL family of fuzzers 7 .For completeness, in Appendix E we provide as well a simple edge count.The coverage data may be inaccurate due to colliding edges when measured as in AFL.To rectify this, we instrument the target programs to store the full execution traces and compute precise coverage data.In Appendix E we also show the imprecise coverage, measured as in AFL.
We measure the total number of distinct vulnerabilities found by each fuzzer.For this purpose, first we confirm the reported vulnerabilities, i.e. we take all seeds generated by a fuzzer and keep those that trigger a crash by any of the sanitizers ASAN [29], UBSAN [27] and Valgrind [24]. 8Then, for each kept seed, we record the source line in the program that triggers a crash (according to the appropriate sanitizer).Finally, we count the number of found distinct source lines for the crash point.
Remark.Some fuzzers [12,13] report so-called path coverage measured as the number of produced seeds.We find this metric rather inaccurate.For instance, a seed that inverts  branches often is equivalent to  seeds that invert a single branch.However, the above path coverage will consider the former as a single path, while the latter as  paths (thus a fuzzer can easily manipulate this metric).In addition, the instability of this metric can be manifested by suppling to AFL all seeds found previously by AFL -often, the AFL will keep much smaller number of seeds (in some cases only 50% of the supplied seeds), because the path coverage metric is dependent on the order of processing the seeds.Therefore, seed count should not be considered a reliable, standalone metric.

Coverage
We run all 12 fuzzers for 12 hours each, and record the coverage discovered during the fuzzing.The results are reported in Figure 4. We can see that at the end, Sivo provides the best coverage for 25 out of the 27 programs.On average Sivo produces 11.8% higher coverage than the next best fuzzer when analyzed individually for each program.In direct comparison to fuzzers, Sivo outperforms the next best fuzzer MOpt by 20.2%, and EcoFuzz by 30.6%, and outperforms the baseline AFL by producing 180% increase in coverage.For most of the programs, our fuzzer very soon establishes as the top fuzzer.In fact, the time frame needed to create advantage is so short, that the improved coverage refinement of Section 4.4 has still not kicked in, whereas the MAB optimization of Section 4.1 had barely any time to feed enough data back to the MABs.Thus, arguably the early advantage of Sivo is achieved due to the parametrize paradigm, as well as the remaining two refinements (TaintFAST and the system solver method).
We test the top three fuzzers Sivo, MOpt, and EcoFuzz on 48hour runs and report the obtained coverage in Figure 5.We see that Sivo is the top fuzzer for 24 of the programs, with 13.4% coverage increase on average with respect to the next best fuzzer for each program, and 15.7%, and 28.1% with respect to MOpt and EcoFuzz.In comparison to the 12-hour runs, the other two fuzzers managed to reduce slightly the coverage gap, but this is expected (given sufficient time all fuzzers will converge).However, the gap is still significant and Sivo provides consistently better coverage.

Vulnerabilities
We summarize the number of vulnerabilities found by each fuzzer on 25 programs during the 12-hour runs in Table 2. (We removed two programs from Table 2, as none of the fuzzers finds vulnerabilities for them.)Out of 25 evaluated programs, Sivo is able to find the maximal number of vulnerabilities in 18 programs (72%).For comparison, the next best fuzzer MOpt holds top positions in 11 programs (44%) in terms of vulnerability discovery.This indicates that Sivo is significantly more efficient at finding vulnerabilities than the remaining candidate fuzzers as well.However, Sivo achieves less top positions in discovery of vulnerabilities compared to code coverage, but this is not unusual as the objective of Sivo is code coverage, and the correlation between produced coverage and found vulnerabilities is not necessarily strong [17,18].
We also measure and report in Table 2 the number of vulnerabilities unique to each fuzzer, i.e. bugs that are found only by one fuzzer, and not by any other.This metric signals distinctiveness of each fuzzer-the greater the number of unique vulnerabilities, the more distinct the fuzzer is on vulnerability detection.Out of 25 programs, Sivo discovers at least one unique vulnerability in 11 programs.In total, Sivo finds 31 unique vulnerabilities, while the next best fuzzer is Honggfuzz [32] with 21 vulnerabilities.

Performance of Refinements
We evaluate the four refinements individually, in terms of their impact and necessity.To assess the impact of a refinement, i.e. to estimate how much it helps to advance the fuzzer, we compare the performance of the baseline version of Sivo (where all four refinements have been removed) to the baseline version with the one refinement added on.On the other hand, to assess the necessity of a refinement, i.e. to estimate how irreplaceable in comparison to the other three refinements it is, we compare the full version of Sivo to the version with a single refinement removed.We note that all refinements aside for the Parametrize-Optimize strategy, can be assessed reasonably well because it is easy to switch them on or off in the fuzzer.The same holds for Optimize, but not for Parametrize.As Sivo is built from scratch with many new fuzzing subroutines that are not necessarily present in AFL, it is not clear which fuzzing subroutines and which of their variations need to be removed in the baseline.Therefore, we only assess Optimize, and consider Table 2: The number of found vulnerabilities.The number of unique vulnerabilities (when non-zero) are reported after "/"."-" indicates failure to instrument/run the program."#Vuln.", "#Vuln.uniq" give the number of all and the number of unique vulnerabilities, respectively."#Top vuln."shows the number of programs for which the fuzzer finds the maximal number of vulnerabilities."#Prog.uniq" shows the number of programs for which the fuzzer finds some unique vulnerability.Parametrize to be part of the baseline.We fuzz the 25 programs (on which Sivo outperformed all other 11 fuzzers) for 12 hours, and compare the found coverage to the coverage produced by the complete version of Sivo.In Table 3, we provide the comparisons (as a percentage drop of the coverage) of the versions.We also give the data about the performance of the best non-Sivo fuzzer for each program (see the column Best NoneSivo).In the last row of the table we summarize the number of programs on which the considered version of the fuzzer is able to out-perform all of the remaining 11 none-Sivo fuzzers (for reference, for Sivo this number is 25).A few observations are evident from the Table 3:

Application
• Parametrize alone is valuable.The baseline SivoBase, i.e. the version of the fuzzer that does not have any of the four refinements aside from Parametrize, already performs well.It is able to achieve the most coverage for 9 of the 25 considered programs.Hence, just by introducing new fuzzing subroutines and their variations, the fuzzer is able to outperform in terms of coverage the other 11 fuzzers on 36% of the fuzzed programs.
• Optimize has a strong impact.Among the four refinements, Optimize has the strongest impact.It helps the baseline fuzzer to add 10 top stops resulting in 19 top positions (refer to SivoBase+Opt column in Table 3), thus leading to most coverage in comparison to the other 11 none-Sivo fuzzers on 76% of the programs.On the other hand, Sivo without Optimize (refer to Sivo-Opt), loses 14 top positions, i.e. the fuzzer loses the top spot for 56% of the programs.
Moreover, this refinement effects all of the fuzzed programs, with the exception of a few.The effect is significant-the coverage drop when this refinement is not present is at least 10% and sometimes more than 30%.• TaintFAST has a moderate to low impact.This refinement, denoted as FI in the Table 3, helps the baseline fuzzer to add two top spots.On the other hand, Sivo without TaintFAST, i.e. with only the FTI engine present, loses three top spots.TaintFAST has a strong variance (refer to the Sivo-FI column) in terms of providing additional coverage and most fuzzed programs either benefit largely, or have no benefit at all.This is not unexpected, because the pact.This refinement does not have a strong impact on providing top positions (no jumps after adding it to the baseline, and lost 4 positions when removing it from Sivo), but it gives well balanced improvements in coverage to the fuzzer.

The Cause Of Observed Benefits
It is important to understand and explain why certain fuzzing techniques (or in our case refinements) work well.In Section 5.4 we speculate about the type of programs that can be fuzzed well with some of the refinements.Showing this conclusively, however, is difficult.Table 3 shows the percentage drop in coverage observe, per application, obtained by adding and removing one-by-one each of our proposed refinements.However, attributing the cause of improved performance to individual refinements based on such coarse empirical data could be misleading.This is because we are measuring the joint outcome of mutually-dependent fuzzing strategies.We cannot single out the cause of an observed outcome and attribute it to each strategy, since the strategies mutate the internal state that others use.We thus only coarsely estimate their impact via our empirical findings and speculate that these results extend to other programs.

RELATED WORK
Grey-box fuzzers, starting from the baseline AFL [37], have been the backbone of modern, large-scale testing efforts.The AFL-family of fuzzers (e.g.AFLGo [3], AFLFast [4], LAF-Intel [1], MOpt [22], and MTFuzz [30]) improve upon different aspects of the baseline fuzzer.For instance, instead of randomly selecting mutation strategy, MOpt [22] uses particle swarm optimization to guide the selection.MTFuzz [30] trains a multiple-task neural network to infer the relationship between program inputs and different kinds of edge coverage to guide input mutation.Similarly, for the seed selection, AFLFast [4] prioritizes seeds that exercise low-probability paths, CollAFL [13] prioritizes seeds that have a lot of not-yet inverted branches, and EcoFuzz [35] uses multi-armed bandits to guide the seed selection.Common feature for all current fuzzers from the AFLfamily is that they optimize at most one of the fuzzing subroutine 9 .
In contrast, Sivo first parameterizes all aspects, i.e. introduces many variations of the fuzzing subroutines, and then tries to optimize all the selection of parameters.Even the seed selection subroutines of EcoFuzz and Sivo differ, despite both using multi-armed bandits: EcoFuzz utilizes MAB to select candidate seed from the pool, whereas Sivo uses MAB to decide on the selection criterion and the pool of seeds.Several grey-box fuzzers deploy data-flow fuzzing, i.e. infer dependency of branches on input bytes and use it to accomplish more targeted branch inversion.VUzzer [26], Angora [6], BuzzFuzz [14] and Matryoshka [7] use a classical dynamic taint inference engine (i.e.track taint propagation) to infer dependencies.Fairfuzz [20], ProFuzzer [34], and Eclipser [8] use lighter engine and infer partial dependency by monitoring the execution traces of the seeds.RedQueen [2] and Steelix [21] can infer only dependencies based on exact (often called direct) copies of input bytes in the branches, by mutating individual bytes.Among grey boxes, the best inference in terms of speed, type, and accuracy is achieved by GreyOne [12].Its engine called FTI is based on mutation of individual bytes (thus fast because it does not track taint propagation) and can detect dependencies of any type (not only direct copies of input bytes).FTI mutates bytes one by one and checks on changes in variables involved in branch conditions (thus accurate because it does not need for the whole branch to flip, only some of its variables).Sivo inference engine TaintFAST improves upon FTI and provides exponential decrease in the number of executions required to infer the full dependency, at a possible expense of accuracy.Instead of testing bytes one by one, TaintFAST uses probabilistic group testing and reduces the number of executions.
Data-flow grey boxes accomplish targeted branch inversion by randomly mutating the dependent bytes.A few fuzzers deploy more advanced strategies: Angora [6] uses gradient-descent based mutation, Eclipser [8] can invert efficiently branches that are linear or monotonic, and GreyOne [12] inverts branches by gradually reducing the distance between the actual and expected value in the branch condition.Some fuzzers, such as RedQueen and Steelix invert branches by solving directly the branch conditions based on equality (called magic bytes).Sivo can solve more complex branch inversion conditions that involve inequalities, without the use of SAT/SMT solvers.On the other hand, white boxes such as KLEE [5], and hybrid fuzzers such as Driller [31] and QSYM [36], use symbolic execution that relies on SMT solvers (thus it may be slow) to perform inversions in even more complex branches.The hybrid fuzzer Pangolin [16] uses linear approximations of branch constraints (thus more general than our intervals) called polyhedral path abstraction and later it utilizes them to efficiently sample solutions that satisfy path constraints.To infer the (more universal) linear approximations, Pangolin uses a method based on SMT solver.On the other hand, Sivo infers the (less universal) intervals with a simpler method.
The AFL-family of fuzzers as well as many other grey boxes track edge coverage.In addition, the AFL-family uses bucketization, i.e. besides edges, they track the counts of edges and group them in buckets that have ranges of powers of two.For practical purposes AFL does not record the precise edges (this will require storing whole execution traces which may be slow), but rather it works with hashes of edges (which is quite fast).The process of hashing may introduce collisions as noted by CollAFL [13].To avoid such collisions, CollAFL proposes during compilation to choose the free parameters of the hashing function non-randomly, and according to a specific strategy.AFL++ [11] uses a similar idea and provides an open-source implementation based on link-time instrumentation.In addition, AFL++,LibFuzzer [28], and Honggfuzz [32] use so-called sanitizer coverage available in LLVM starting from version 11 to prevent collisions by assigning the free parameters during runtime.On the other hand, Sivo solution is to switch between different hashing functions during the fuzzing (i.e. at runtime).Instead of tracking edge coverage, a few fuzzers such as Honggfuzz [32], VUzzer [26] and LibFuzzer [28] track block coverage.Moreover, the grey-box fuzzer TortoiseFuzz [33] uses alternative coverage measurement metric (assigns different weights to edges based on their potential security impact) to prioritize testcases, and achieves higher rate of vulnerability detection.

CONCLUSION
We have presented four refinements for grey boxes that boost different fuzzing stages.First, we have shown fast taint engine that requires only logarithmic number of tests in the number of input bytes to infer the dependencies of branches on inputs.Second, we have provided an efficient method for inverting branches when they depend trivially on input bytes and their conditions are based on integer inequalities.Third, we have proposed an improved coverage tracking methods that are easy to implement.Finally, we have show the parametrize-optimize paradigm that allows fuzzers to be more flexible in adopting to target programs and thus to effectively increase coverage.We have implemented the refinements in a fuzzer called Sivo.In comparison to 11 other popular grey-box fuzzers, Sivo scores highest with regards to coverage and to number of found vulnerabilities.

A ANALYSIS OF MULTI-LABEL BLOCK ASSIGNMENT
When each basic block   is assigned an -bit random label   , by the birthday paradox collisions on hashed edges   ≪ ⊕  will appear once the number of edges 2  reaches around 2  ≥ 2  2 .Roughly, the expected number of collisions is around 2  • 2  /2  = 2 2 − .If we assign an additional (second) label to each basic block, then there will be roughly 2 2 − colliding edge hashes on the second label as well.However, among the colliding hash edges on the first and the second label, there will be only 2 2 − • 2 2 − /2  = 2 3 −2 common edges.Thus, if 3 − 2 < 0, i.e.  < 2 3 , then on average there will not be any common edges, and hence for at least one of the assignments (either the first or the second), each edge will have unique hash.Similar analysis applies for larger number of labels.In general, if each block is assigned  labels of  random bits each, and the number of edges is smaller than 2  +1  , then on average each hashed edge will be unique at least for one of the labels.
We can also obtain a strict analysis is of the collision-free multilabel assignment.To do so, we compute the probability that each additional hashed edge does not collide with any other previous hashed edge at least on one of the labels.If there are already  such hashed edges, then the probability that the next  + 1 will also be good is 1 − (  2  )  .This is computed from the opposite event (that the  + 1-th hash collides on all  labels with some of the  hashes).Thus the probability that all 2  edges will be unique on at least one of the  labels is: We could not reduce the above formula (1) further (unlike the case of  = 1 where Sterling approximation can be used), however we can calculate numerical values of the probability for different values of the parameters , , .In the case of Sivo,  = 16,  = 4, and therefore, even when there are  = 8000 edges, the probability that all of them will be unique on some label is around 0.701.For larger range of computed probabilities refer to Table 4. Table 4: Calculated values of the probability (1) / 3000 4000 5000 6000 7000 8000 9000 10000 1 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 2 0.123 0.007 0.000 0.000 0.000 0.000 0.000 0.000

C INSTRUMENTATION OVERHEAD
Each of the 27 evaluated programs is instrumented and compiled with two different passes: lighter and heavier.We measured experimentally the usage and the overhead of the two instrumentations averaged over all 27 tested programs, and provide the data in Table 5.

E ALTERNATIVE COVERAGE
In Figure 6 we show the simple edge count for fuzzers on 12-hour runs.Sivo is top fuzzer for 21 of the programs in the 12-hour runs.In comparison to Figure 4, Sivo loses 4 top spots, 2 of which are with a margin of less than 1%.In Figure 7 we show the coverage measured imprecisely as in AFL for fuzzers on 12-hour runs.Sivo is top fuzzer for 17 out of 24 programs (we had problems compiling 3 of the programs).In comparison to Figure 4, Sivo loses 6 top spots.This is a result of colliding edge hashes and the inability of AFL coverage engine to detect and handle such collisions.

Figure 2 :
Figure 2: Branches and systems of intervals.

Figure 3 :
Figure 3: The effects of AFL's edge count compression.

Figure 6 :
Figure 6: Edge count for all fuzzers during 12 hours of fuzzing.

Figure 7 :
Figure 7: AFL coverage for all fuzzers during 12 hours of fuzzing.

Table 1 :
Fuzzer procedures and their variations sub-versions or parameters which are also chosen with MAB.For instance, mutation of random bytes supports two versions: it can use heuristics to determine the positions of the bytes (choice 1), or use random byte positions (choice 2).If choice 1, then it needs

Table 3 :
Percentage drop in coverage of fuzzers in comparison to Sivo.When no drop occurs, the cells are empty.