Quantitative Mitigation of Timing Side Channels

Timing side channels pose a significant threat to the security and privacy of software applications. We propose an approach for mitigating this problem by decreasing the strength of the side channels as measured by entropy-based objectives, such as min-guess entropy. Our goal is to minimize the information leaks while guaranteeing a user-specified maximal acceptable performance overhead. We dub the decision version of this problem Shannon mitigation, and consider two variants, deterministic and stochastic. First, we show the deterministic variant is NP-hard. However, we give a polynomial algorithm that finds an optimal solution from a restricted set. Second, for the stochastic variant, we develop an algorithm that uses optimization techniques specific to the entropy-based objective used. For instance, for min-guess entropy, we used mixed integer-linear programming. We apply the algorithm to a threat model where the attacker gets to make functional observations, that is, where she observes the running time of the program for the same secret value combined with different public input values. Existing mitigation approaches do not give confidentiality or performance guarantees for this threat model. We evaluate our tool SCHMIT on a number of micro-benchmarks and real-world applications with different entropy-based objectives. In contrast to the existing mitigation approaches, we show that in the functional-observation threat model, SCHMIT is scalable and able to maximize confidentiality under the performance overhead bound.


Introduction
Information leaks through timing side channels remain a challenging problem [38,32,27,17,14,51,40]. A program leaks secret information through timing side channels if an attacker can deduce secret values (or their properties) by observing response times. We consider the problem of mitigating timing side channels. Unlike elimination techniques [7,34,50] that aim to completely remove timing leaks without considering the performance penalty, the goal of mitigation techniques [29,10,52] is to weaken the leaks, while keeping the penalty low.
We define the Shannon mitigation problem that decides whether there is a mitigation policy to achieve a lower bound on a given security entropy-based measure while respecting an upper bound on the performance overhead. Consider an example where the program-under-analysis has a secret variable with seven possible values, and has three different timing behaviors, each forming a cluster of secret values. It takes 1 second if the secret value is 1, it takes 5 seconds if the secret is between 2 and 5, and it takes 10 seconds if the secret value is 6 or 7. The entropy-based measure quantifies the remaining uncertainty about the secret after timing observations. Min-guess entropy [28,45,11] for this program is 1, because if the observed execution time is 1, the attacker guesses the secret in one try. A mitigation policy involves merging some timing clusters by introducing delays. A good solution might be to introduce a 9 second delay if the secret is 1, which merges two timing clusters. But, this might be disallowed by the budget on the performance overhead. Therefore, another solution must be found, such as introducing a 4 seconds delay when the secret is one.
We develop two variants of the Shannon mitigation problem: deterministic and stochastic. The mitigation policy of the deterministic variant requires us to move all secret values associated to an observation to another observation, while the policy of the stochastic variant allows us to move only a portion of secret values in an observation to another one. We show that the deterministic variant of the Shannon mitigation problem is intractable and propose a dynamic programming algorithm to approximate the optimal solution for the problem by searching through a restricted set of solutions. We develop an algorithm that reduces the problem in the stochastic variant to a well-known optimization problem that depends on the entropy-based measure. For instance, with minguess entropy, the optimization problem is mixed integer-linear programming.
We consider a threat model where an attacker knows the public inputs (known-message attacks [29]), and furthermore, where the public input changes much more often than the secret inputs (for instance, secrets such as bank account numbers do not change often). As a result, for each secret, the attacker observes a timing function of the public inputs. We call this model functional observations of timing side channels.
We develop our tool Schmit that has three components: side channel discovery [49], search for the mitigation policy, and the policy enforcement. The side channel discovery builds the functional observations [49] and measures the entropy of secret set after the observations. The mitigation policy component includes the implementation of the dynamic programming and optimization algorithms. The enforcement component is a monitoring system that uses the program internals and functional observations to enforce the policy at runtime. To summarize, we make the following contributions: -We formalize the Shannon mitigation problem with two variants and show that the complexity of finding deterministic mitigation policy is NP-hard. -We describe two algorithms for synthesizing the mitigation policy: one is based on dynamic programming for the deterministic variant, that is in polynomial time and results in an approximate solution, and the other one solves the stochastic variant of the problem with optimization techniques. -We consider a threat model that results in functional observations. On a set of micro-benchmarks, we show that existing mitigation techniques are not secure and efficient for this threat model. -We evaluate our approach on five real-world Java applications. We show that Schmit is scalable in synthesizing mitigation policy within a few seconds and significantly improves the security (entropy) of the applications.

Overview
First, we describe the threat model considered in this paper. Second, we describe our approach on a running example. Third, we compare the results of Schmit with the existing mitigation techniques [29,10,52] and show that Schmit achieves the highest entropy (i.e., best mitigation) for all three entropy objectives. Threat Model. We assume that the attacker has access to the source code and the mitigation model, and she can sample the run-time of the application arbitrarily many times on her own machine. During an attack, she intends to guess a fixed secret of the target machine by observing the mitigated running time. Since we consider the attack models where the attacker knows the public inputs and the secret inputs are less volatile than public inputs, her observations are functional observations, where for each secret value, she learns a function from the public inputs to the running time.
Example 2.1. Consider the program shown in Fig 1(a). It takes secret and public values as inputs. The running time depends on the number of set bits in both secret and public inputs. We assume that secret and public inputs can be between 1 and 1023. Fig 1(b) shows the running time of different secret values as timing functions, i.e., functions from the public inputs to the running time.
Side channel discovery. One can use existing tools to find the initial functional observations [49,48]. In Example 9.1, functional observations are F = y, 2y, . . . , 10y where y is a variable whose value is the number of set bits in the public input. The corresponding secret classes after this observation is S F = 1 1 , 1 2 , 1 3 , . . . , 1 10 where 1 n shows a set of secret values that have n set bits. The sizes of classes are B = {10, 45, 120, 210, 252, 210, 120, 45, 10, 1}. We use L 1norm as metric to calculate the distance between the functional observations F. This distance (penalty) matrix specifies extra performance overhead to move from one functional observation to another. With the assumption of uniform distributions over the secret input, Shannon entropy, guessing entropy, and the min-guessing entropy are 7.3, 90.1, and 1.0, respectively. These entropies are defined in Section 3 and measure the remaining entropy of the secret set after the observations. We aim to maximize the entropy measures, while keeping the performance overhead below a threshold, say 60% for this example. Mitigation with Schmit. We use our tool Schmit to mitigate timing leaks of Example 9.1. The mitigation policy for the Shannon entropy objective is shown in Fig 2(a). The policy results in two classes of observations. The policy requires to move functional observations y, 2y, . . . , 5y to 6y and all other observations 7y, 8y, 9y to 10y . To enforce this policy, we use a monitoring system at runtime. The monitoring system uses a decision tree model of the initial functional observations. The decision tree model characterizes each functional observation with associated program internals such as method calls or basic block invocations [48,47]. The decision tree model for the Example 9.1 is shown in Fig 2(b).
The monitoring system records program internals and matches it with the decision tree model to detect the current functional observation. Then, it adds delays, if necessary, to the execution time in order to enforce the mitigation policy. With this method, the mitigated functional observation is G = 6y, 10y and the secret class is S G = {1 1 , 1 2 , 1 3 , 1 4 , 1 5 , 1 6 }, {1 7 , 1 8 , 1 9 , 1 10 } as shown in Fig 2 (c). The performance overhead of this mitigation is 43.1%. The Shannon, guessing, and min-guess entropies have improved to 9.7, 459.6, and 193.5, respectively. Comparison with state of the art. We compare our mitigation results to black-box mitigation scheme [10] and bucketing [29]. Black-box double scheme technique. We use the double scheme technique [10] to mitigate the leaks of Example 9.1. This mitigation uses a prediction model to release events at scheduled times. Let us consider the prediction for releasing the event i at N -th epoch with S(N, i) = max(inp i , S(N, i−1))+p(N ), where inp i is the time arrival of the i-th request, S(N, i − 1) is the prediction for the request i−1, and p(N ) = 2 N −1 models the basis for the prediction scheme at N -th epoch. We assume that the request are the same type and the sequence of public input requests for each se- .00, 321.5, and 5.5, respectively. Bucketing. We consider the mitigation approach with buckets [29]. For Example 9.1, if the attacker does not know the public input (unknown-message attacks [29]), the observations are {1.1, 2.1, 3.3, · · · , 9.9, 10.9, · · · , 109.5} as shown in Fig 3(b). We apply the bucketing algorithm in [29] for this observations, and it finds two buckets {37.5, 109.5} shown with the red lines in Fig 3(b). The bucketing mitigation requires to move the observations to the closet bucket. Without functional observations, there are 2 classes of observations. However, with functional observations, there are more than 2 observations.

Preliminaries
For a finite set Q, we use |Q| for its cardinality. A discrete probability distribution, or just distribution, over a set Q is a function d : Definition 1 (Timing Model). The timing model of a program P is a tuple . . , y m } is the set of public-input variables, S ⊆ R n is a finite set of secret-inputs, and δ : R n × R m → R ≥0 is the execution-time function of the program over the secret and public inputs.
We assume that the adversary knows the program and wishes to learn the value of the secret input. To do so, for some fixed secret value s ∈ S, the adversary can invoke the program to estimate (to an arbitrary precision) the execution time of the program. If the set of public inputs is empty, i.e. m = 0, the adversary can only make scalar observations of the execution time corresponding to a secret value. In the more general setting, however, the adversary can arrange his observations in a functional form by estimating an approximation of the timing function δ(s) : R m → R ≥0 of the program.
A functional observation of the program P for a secret input s ∈ S is the function δ(s) : be the finite set of all functional observations of the program P. We define an order ≺ over the functional observations F: The set F characterizes an equivalence relation ≡ F , namely secrets with equivalent functional observations, over the set S, defined as following: We write S f for the secret set S ∈ S F corresponding to the observations f ∈ F.
Shannon entropy, guessing entropy, and min-guess entropy are three prevalent information metrics to quantify information leaks in programs. Köpf and Basin [28] characterize expressions for various information-theoretic measures on information leaks when there is a uniform distribution on S given below.
Assuming a uniform distribution on S, entropies can be characterized as:

Shannon Mitigation Problem
Our goal is to mitigate the information leakage due to the timing side channels by adding synthetic delays to the program. An aggressive, but commonly-used, mitigation strategy aims to eliminate the side channels by adding delays such that every secret value yields a common functional observation. However, this strategy may often be impractical as it may result in unacceptable performance degradations of the response time. Assuming a well-known penalty function associated with the performance degradation, we study the problem of maximizing entropy while respecting a bound on the performance degradation. We dub the decision version of this problem Shannon mitigation.
Adding synthetic delays to execution-time of the program, so as to mask the side-channel, can give rise to new functional observations that correspond to upper-envelopes of various combinations of original observations. Let F = f 1 , f 2 , . . . , f k be the set of functional observations. For I ⊆ 1, 2, . . . , k, let Mitigation Policies. Let G ⊆ G(F) be a set of admissible post-mitigation observations. A mitigation policy is a function µ : F → D(G) that for each secret s ∈ S f suggests the probability distribution µ(f ) over the functional observations. We say that a mitigation policy is deterministic if for all f ∈ F we have that µ(f ) is a point distribution. Abusing notations, we represent a deterministic mitigation policy as a function µ : F → G. The semantics of a mitigation policy recommends to a program analyst a probability µ(f )(g) to elevate a secret input s ∈ S f from the observational class f to the class g ∈ G by adding max {0, g(p) − f (p)} units delay to the corresponding execution-time δ(s, p) for all p ∈ Y . We assume that the mitigation policies respect the order, i.e. for every mitigation policy µ and for all f ∈ F and g ∈ G, we have that µ(f )(g) > 0 implies that f ≺ g. Let M (F →G) be the set of mitigation policies from the set of observational clusters F into the clusters G.
For the functional observations F = f 1 , . . . , f k and a mitigation policy µ ∈ M (F →G) , the resulting observation set F[µ] ⊆ G is defined as: Since the mitigation policy is stochastic, we use average sizes of resulting observations to represent fitness of a mitigation policy. For F[µ] = g 1 , g 2 , . . . , g , we define their expected class sizes . Assuming a uniform distribution on S, various entropies for the expected class size after applying a policy µ ∈ M (F →G) can be characterized by the following expressions: We note that the above definitions do not represent the expected entropies, but rather entropies corresponding to the expected cluster sizes. However, the three quantities provide bounds on the expected entropies after applying µ. Since Shannon and Min-Guess entropies are concave functions, from Jensen's inequality, we get that SE(S|F, µ) and mGE(S|F, µ) are upper bounds on expected Shannon and Min-Guess entropies. Similarly, GE(S|F, µ), being a convex function, give a lower bound on expected guessing entropy.
We are interested in maximizing the entropy while respecting constraints on the overall performance of the system. We formalize the notion of performance by introducing performance penalties: there is a function π : F × G → R ≥0 such that elevating from the observation f ∈ F to the functional observation g ∈ G adds an extra π(f, g) performance overheads to the program. The expected performance penalty associated with a policy µ, π(µ), is defined as the probabilistically weighted sum of the penalties, i.e. f ∈F ,g∈G:f ≺g |S f |·µ(f )(g)·π(f, g). Now, we introduce our key decision problem.
Definition 2 (Shannon Mitigation). Given a set of functional observations F = f 1 , . . . , f k , a set of admissible post-mitigation observations G ⊆ G(F), set of secrets S, a penalty function π : F × G → R ≥0 , a performance penalty upper bound ∆ ∈ R ≥0 , and an entropy lower-bound E ∈ R ≥0 , the Shannon mitigation problem Shan E (F, G, S, π, E, ∆), for a given entropy measure E ∈ {SE, GE, mGE}, is to decide whether there exists a mitigation policy µ ∈ M (F →G) such that E(S|F, µ) ≥ E and π(µ) ≤ ∆. We define the deterministic Shannon mitigation variant where the goal is to find a deterministic such policy.

Deterministic Shannon Mitigation
We first establish the intractability of the deterministic variant. Proof. It is easy to see that the deterministic Shannon mitigation problem is in NP: one can guess a certificate as a deterministic mitigation policy µ ∈ M (F →G) and can verify in polynomial time that it satisfies the entropy and overhead constraints. Next, we sketch the hardness proof for the min-guess entropy measure by providing a reduction from the two-way partitioning problem [31]. For the Shannon entropy and guess entropy measures, a reduction can be established from the Shannon capacity problem [19] and the Euclidean sum-of-squares clustering problem [8], respectively.
Given a set A = {a 1 , a 2 , . . . , a k } of integer values, the two-way partitioning problem is to decide whether there is a partition A 1 A 2 = A into two sets A 1 and A 2 with equal sums, i.e. a∈A1 a = a∈A2 a. W.l.o.g assume that a i ≤ a j for i ≤ j. We reduce this problem to a deterministic Shannon mitigation problem 1≤i≤k a i is odd then the solution to the two-way partitioning instance is trivially no. Otherwise, let E A = (1/2) 1≤i≤k a i . Notice that any deterministic mitigation strategy that achieves min-guess entropy larger than or equal to E A must have at most two clusters. On the other hand, the best min-guess entropy value can be achieved by having just a single cluster. To avoid this and force getting two clusters corresponding to the two partitions of a solution to the two-way partitions problem instance A, we introduce performance penalties such that merging more than k − 2 clusters is disallowed by keeping performance penalty π A (f, g) = 1 and performance overhead ∆ A = k − 2. It is straightforward to verify that an instance of the resulting min-guess entropy problem has a yes answer if and only if the two-way partitioning instance does.
Since the deterministic Shannon mitigation problem is intractable, we design an approximate solution for the problem. Note that the problem is hard even if we only use existing functional observations for mitigation, i.e., G = F. Therefore, we consider this case for the approximate solution. Furthermore, we assume the following sequential dominance restriction on a deterministic policy µ: for f, g ∈ F if f ≺ g then either µ(f ) ≺ g or µ(f ) = µ(g). In other words, for any given f ≺ g, f can not be moved to a higher cluster than g without having g be moved to that cluster. For example, Fig 4(a) shows Shannon mitigation problem with four functional observations and all possible mitigation policies (we represent µ(f i )(f j ) with µ(i, j)). Fig 4(b) satisfies the sequential dominance restriction, while Fig 4(c) does not.
The search for the deterministic policies satisfying the sequential dominance restriction can be performed efficiently using dynamic programming by effective use of intermediate results' memorizations.
Algorithm (1) provides a pseudocode for the dynamic programming solution to find a deterministic mitigation policy satisfying the sequential dominance.
The key idea is to start with considering policies that produce a single cluster for subclasses P i of the problem with the observation from f 1 , . . . , f i , and then compute policies producing one additional cluster in each step by utilizing the previously computed sub-problems and keeping track of the performance penalties. The algorithm terminates as soon as the solution of the current step respects the performance bound. The complexity of the algorithm is O(k 3 ).

Stochastic Shannon Mitigation Algorithm
Next, we solve the (stochastic) Shannon mitigation problem by posing it as an optimization problem. Consider the stochastic Shannon mitigation problem Shan E (F, G = F, S F , π, E, ∆) with a stochastic policy µ : F → D(G) and
Maximize E, subject to: Here, the objective function E is one of the following functions: The linear constraints for the problem are defined as the following. The condition (1) and (2) express that µ provides a probability distributions, condition (3) provides restrictions regarding the performance constraint, and the condition (4) is the entropy specific constraint. The objective function of the optimization problem is defined based on the entropy criteria from E. For the simplicity, we omit the constant terms from the objective function definitions. For the guessing entropy, the problem is an instance of linearly constrained quadratic optimization problem [36]. The problem with Shannon entropy is a non-linear optimization problem [12]. Finally, the optimization problem with min-guess entropy is an instance of mixed integer programming [35]. We evaluate the scalability of these solvers empirically in Section 6 and leave the exact complexity as an open problem. We show that the min-guess entropy objective function can be efficiently solved with the branch and bound algorithms [39]. Fig 4(b,c) show two instantiations of the mitigation policies that are possible for the stochastic mitigation.

Implementation Details
A. Environmental Setups. All timing measurements are conducted on an Intel NUC5i5RYH. We switch off JIT Compilation and run each experiment multiple times and use the mean running time. This helps to reduce the effects of environmental factors such as the Garbage Collections. All other analyses are conducted on an Intel i5-2.7 GHz machine. B. Implementation of Side Channel Discovery. We use the technique presented in [49] for the side channel discovery. The technique applies the functional data analysis [41] to create B-spline basis and fit functions to the vector of timing observations for each secret value. Then, the technique applies the functional data clustering [23] to obtain K classes of observations. We use the number of secret values in a cluster as the class size metric and the L 1 distance norm between the clusters as the penalty function. C. Implementation of Mitigation Policy Algorithms. For the stochastic optimization, we encode the Shannon entropy and guessing entropy with linear constraints in Scipy [25]. Since the objective functions are non-linear (for the Shannon entropy) and quadratic (for the guessing entropy), Scipy uses sequential least square programming (SLSQP) [37] to maximize the objectives. For the stochastic optimization with the min-guess entropy, we encode the problem in Gurobi [21] as a mixed-integer programming (MIP) problem [35]. Gurobi solves the problem efficiently with branch-and-bound algorithms [1]. We use Java to implement the dynamic programming. D. Implementation of Enforcement. The enforcement of mitigation policy is implemented in two steps. First, we use the initial timing functions and characterize them with program internal properties such as basic block calls. To do so, we use the decision tree learning approach presented in [49]. The decision tree model characterizes each functional observations with properties of program internals. Second, given the policy of mitigation, we enforce the mitigation policy with a monitoring system implemented on top of the Javassist [16] library.
The monitoring system uses the decision tree model and matches the properties enabled during an execution with the tree model (detection of the current cluster). Then, it adds extra delays, based on the mitigation policy, to the current execution-time and enforces the mitigation policy. Note that the dynamic monitoring can result in a few micro-second delays. For the programs with timing differences in the order of micro-seconds, we transform source code using the decision tree model. The transformation requires manual efforts to modify and compile the new program. But, it adds negligible delays.
E. Micro-benchmark Results. Our goal is to compare different mitigation methods in terms of their security and performance. We examine the computation time of our tool Schmit in calculating the mitigation policies. See appendix for the relationships between performance bounds and entropy measures.
Applications: Mod Exp applications [33] are instances of square-and-multiply modular exponentiation (R = y k mod n) used for secret key operations in RSA [43]. Branch  For the min-guess entropy, we observe that both stochastic and dynamic programming approaches are efficient and fast as shown in Fig 5(a). For the Shannon and guessing entropies, the dynamic programming is scalable, while the stochastic mitigation is computationally expensive beyond 60 classes of observations as shown in Fig 5(b,c).
Mitigation Algorithm Comparisons: Tab 1 shows micro-benchmark results that compare the four mitigation algorithms with the two program series. Double scheme mitigation technique [10] does not provide guarantees on the performance overhead, and we can see that it is increased by more than 75 times for mod exp 6. Double scheme method reduces the number of classes of observations. However, we observe that this mitigation has difficulty improving the min-guess entropy. Second, Bucketing algorithm [29] can guarantee the performance overhead, but it is not an effective method to improve the security of functional observations, see the examples mod exp 6 and Branch and Loop 6. Third, in the algorithms, Schmit guarantees the performance to be below a certain bound, while it results in the highest entropy values. In most cases, the stochastic optimization technique achieves the highest min-entropy value. Here, we show the results with min-guess entropy measure. Also, we have strong evidences to show that Schmit achieves higher Shannon and guessing entropies. For example, in B L 5, the initial Shannon entropy has improved from 2.72 to 6.62, 4.1, 7.56, and 7.28 for the double scheme, the bucketing, the stochastic, and the deterministic algorithms, respectively.

Case Study
Research Question. Does Schmit scale well and improve the security of applications (entropy measures) within the given performance bounds? Methodology. We use the deterministic and stochastic algorithms for mitigating the leaks. We show our results for the min-guess entropy, but other entropy measures can be applied as well. Since the task is to mitigate existing leakages, we assume that the secret and public inputs are given. GabFeed is a chat server with 573 methods [4]. There is a side channel in the authentication part of the application where the application takes users' public keys and its own private key, and generating a common key [15]. The vulnerability leaks the number of set bits in the secret key. Initial functional observations are shown in Fig 6a. There are 34 clusters and min-guess entropy is 1. We aim to maximize the min-guess entropy under the performance overhead of 50%. Jetty. We mitigate the side channels in util.security package of Eclipse Jetty web server. The package has Credential class which had a timing side channel. This vulnerability was analyzed in [15] and fixed initially in [6]. Then, the developers noticed that the implementation in [6] can still leak information and fixed this issue with a new implementation in [5]. However, this new implementation is still leaking information [49]. We apply Schmit to mitigate this timing side channels. Initial functional observations is shown in Fig 6d. There are 20 classes of observations and the initial min-guess entropy is 4.5. We aim to maximize the min-guess entropy under the performance overhead of 50%. Java Verbal Expressions is a library with 61 methods that construct regular expressions [2]. There is a timing side channel in the library similar to password comparison vulnerability [3] if the library has secret inputs. In this case, starting from the initial character of a candidate expression, if the character matches with the regular expression, it slightly takes more time to respond the request than otherwise. This vulnerability can leak all the regular expressions. We consider regular expressions to have a maximum size of 9. There are 9 classes of observations and the initial min-guess entropy is 50.5. We aim to maximize the min-guess entropy under the performance overhead of 50%. Password Checker. We consider the password matching example from loginBad program [9]. The password stored in the server is secret, and the user's guess is a public input. We consider 20 secret (lengths at most 6) and 2,620 public inputs. There are 6 different clusters, and the initial min-guess entropy is 1. Findings for GabFeed. With the stochastic algorithm, Schmit calculates the mitigation policy that results in 4 clusters. This policy improves the min-guess entropy from 1 to 138.5 and adds an overhead of 42.8%. With deterministic algorithm, Schmit returns 3 clusters. The performance overhead is 49.7% and the min-guess entropy improves from 1 to 106. The user chooses the deterministic policy and enforces the mitigation. We apply CART decision tree learning and characterizes the classes of observations with GabFeed method calls as shown in

Related Work
Quantitative theory of information have been widely used to measure how much information is being leaked with side-channel observations [45,28,11,22]. Mitigation techniques increase the remaining entropy of secret sets leaked through the side channels, while considering the performance [29,10,52,53,26,44]. Köpf and Dürmuth [29] use a bucketing algorithm to partition programs' observations into intervals. With the unknown-message threat model, Köpf and Dürmuth [29] propose a dynamic programming algorithm to find the optimal number of possible observations under a performance penalty. The works [10,52] introduce different black-box schemes to mitigate leaks. In particular, Askarov et al. [10] show the quantizing time techniques, which permit events to release at scheduled constant slots, have the worst case leakage if the slot is not filled with events. Instead, they introduce the double scheme method that has a schedule of predictions like the quantizing approach, but if the event source fails to deliver events at the predicted time, the failure results in generating a new schedule in which the interval between predictions is doubled. We compare our mitigation technique with both algorithms throughout this paper.
Elimination of timing side channels is a common technique to guarantee the confidentiality of software [7,34,50,18,30,33]. The work [50] aims to eliminate side channels using static analysis enhanced with various techniques to keep the performance overheads low without guaranteeing the amounts of overhead. In contrast, we use dynamic analysis and allow a small amount of information to leak, but we guarantee an upper-bound on the performance overhead.
Machine learning techniques have been used for explaining timing differences between traces [46,47,48]. Tizpaz-Niari et al. [48] consider performance issues in softwares. They also cluster execution times of programs and then explain what program properties distinguish the different functional clusters. We adopt their techniques for our security problem.

Overview of Schmit
Schmit consists of three components: 1) Initial Security Analysis. Inspired by [49], for each secret value, we use B-spline basis [42] in general to model arbitrary timing functions of secret values in the domain of public inputs, but we also allow simpler functional models such as polynomial functions. We use the non-parametric functional clustering [20] with hierarchal algorithms [24] to obtain the initial classes of observations or clusters. The clustering algorithm groups timing functions that are close to each other in the same cluster. The size of class is the number of secret values in the cluster. The l−norm distance between clusters forms the penalty matrix.
Highlight. This step finds the classes of observations over secret values using functional clustering and returns the label (cluster) of each secret value and the distance (as a penalty) between the clusters. 2) Mitigation policy. We uses the policy algorithms (Section 5) to calculate the mitigation policy given the clusters, their sizes, and their distances. We use two types of algorithms: deterministic and stochastic. The deterministic algorithm is an instance of dynamic programming implemented in Java. The stochastic algorithms have three variants for three types of information theory measure. The variant based on Min-guess entropy is the main emphasis in this paper that implemented using Gurobi [21]. The two other variants (for Shannon and Guessing entropies) are implemented in python using Scipy library [25]. See Section 6(C) for further details.
Highlight. This step calculates the mitigation policy that shows how to merge different clusters to maximize an information theory criterion given an upperbound on the amount of performance overhead.

3) Enforcement of mitigation policy.
In the first step, we characterize each class of observation with program internal properties. We use decision tree algorithms [13] to characterize each class of observation with corresponding program internal features . Fig 2(b) in Section 2 is an example of decision tree model that characterizes each class of observation of Fig 1(b) in Section 2 with the basic block invocations at line 16    method. In the second step, we enforce the mitigation policy. This step can be done either with a monitoring system at run-time automatically or with a source code transformation semi-automatically. The enforcement uses the decision tree model and matches the properties enabled during an execution with the tree model. Then, it adds extra delays, based on the mitigation policy, to the execution in order to enforce the mitigation policy. The result of mitigation can be verified by applying the clustering algorithm on the mitigated execution times.
Highlight. This step uses the functional clusters and the decision tree model and enforces the mitigation policy either with a monitoring system at run-time or souce code transformations. The clustering algorithm over the mitigated execution times can be used to verify the mitigation model.   Fig 9(a) shows that the stochastic optimization improves the entropy gradually from 95 to 186 by relaxing the bound. However, the dynamic programming has only improved when the performance bound exceeds 1.0. For the Shannon and guessing entropy, Fig 9(b) shows how Schmit improves the entropy with relaxing the performance bounds.