On the analysis of random replacement caches using static probabilistic timing methods for multipath programs
 937 Downloads
Abstract
Probabilistic hard realtime systems, based on hardware architectures that use a random replacement cache, provide a potential means of reducing the hardware overprovision required to accommodate pathological scenarios and the associated extremely rare, but excessively long, worstcase execution times that can occur in deterministic systems. Timing analysis for probabilistic hard realtime systems requires the provision of probabilistic worstcase execution time (pWCET) estimates. The pWCET distribution can be described as an exceedance function which gives an upper bound on the probability that the execution time of a task will exceed any given execution time budget on any particular run. This paper introduces a more effective static probabilistic timing analysis (SPTA) for multipath programs. The analysis estimates the temporal contribution of an evictonmiss, random replacement cache to the pWCET distribution of multipath programs. The analysis uses a conservative join function that provides a proper overapproximation of the possible cache contents and the pWCET distribution on path convergence, irrespective of the actual path followed during execution. Simple program transformations are introduced that reduce the impact of path indeterminism while ensuring sound pWCET estimates. Evaluation shows that the proposed method is efficient at capturing locality in the cache, and substantially outperforms the only prior approach to SPTA for multipath programs based on path merging. The evaluation results show incomparability with analysis for an equivalent deterministic system using an LRU cache. For some benchmarks the performance of LRU is better, while for others, the new analysis techniques show that random replacement has provably better performance.
Keywords
Cache analysis Probabilistic timing analysis Random replacement policy Multipath1 Extensions

we introduce and prove additional properties relevant to the comparison of the contribution of different cache states to the probabilistic worstcase execution time of tasks in Sect. 3;

an improved join transfer function, used to safely merge states from converging paths, is introduced in Sect. 5 and by construction dominates the simple join introduced in Lesage et al. (2015a);

we present and prove the validity of path renaming in Sect. 6 which allows the definition of additional transformations to reduce the set of paths considered during analysis;

our evaluation explores new configurations in terms of both the analysis methods used and the benchmarks considered (see Sect. 7).
2 Introduction
Realtime systems such as those deployed in space, aerospace, automotive and railway applications require guarantees that the probability of the system failing to meet its timing constraints is below an acceptable threshold (e.g. a failure rate of less than \(10^{9}\) per hour for some aerospace and automotive applications). Advances in hardware technology and the large gap between processor and memory speeds, bridged by the use of cache, make it difficult to provide such guarantees without significant overprovision of hardware resources.
The use of deterministic cache replacement policies means that pathological worstcase behaviours need to be accounted for, even when in practice they may have a vanishingly small probability of actually occurring. The use of cache with a random replacement policy means that the probability of pathological worstcase behaviours can be upper bounded at quantifiably extremely low levels, for example well below the maximum permissible failure rate (e.g. \(10^{9}\) per hour) for the system. This allows the extreme worstcase behaviours to be safely ignored, instead of always included in the estimated worstcase execution times.
The random replacement policy further offers a tradeoff between performance and cost thanks to a minimal hardware cost (AlZoubi et al. 2004). The policy and variants have been implemented in a selection of embedded processors (Hennessy and Patterson 2011) such as the ARM Cortex series (2010), or the Freescale MPC8641D (2008). Randomisation further offers some level of protection against sidechannel attacks which allow the leakage of information regarding the running tasks. While methods relying solely on the random replacement policy may still be circumvented (Spreitzer and Plos 2013), the definition of probabilistic timing analysis is a step towards the analysis of other approaches such as randomised placement policies (Wang and Lee 2007; 2008).
The timing behaviour of programs running on a processor with a cache using a random replacement policy can be determined using static probabilistic timing analysis (SPTA). SPTA computes an upper bound on the probabilistic WorstCase Execution Time (pWCET) in terms of an exceedance function. This exceedance function gives the probability, as a function of all possible values for an execution time budget x, that the execution time of the program will exceed that budget on any single run. The reader is referred to Davis et al. (2013) for examples of pWCET distributions, and to CucuGrosjean (2013) for a detailed discussion of what is meant by a pWCET distribution.
This paper introduces an effective SPTA for multipath programs running on hardware that uses an evictonmiss, random replacement cache. Prior work on SPTA for multipath programs by Davis et al. (2013) used a path merging approach to compute cache hit probabilities based on reuse distances. The analysis derived in this paper builds upon more sophisticated SPTA techniques for the analysis of single path programs given by Altmeyer and Davis (2014, 2015). This new analysis provides substantially improved results compared to the path merging approach. To allow the analysis of the behaviour of caches in isolation, we assume the existence of a valid decomposition of the architecture with regards to cache effects with bounded hit and miss latencies (Hahn et al. 2015).
2.1 Related work
We now set the work on SPTA in context with respect to related work on both probabilistic hard realtime systems and cache analysis for deterministic replacement policies. The methods introduced in this paper belong to the realm of analyses that estimate bounds on the execution time of a program. These bounds may be classified as either a worstcase probability distribution (pWCET) or a worstcase value (WCET).
The first class is a more recent research area with the first work on providing bounds described by probability distributions published by Edgar and Burns (2000, 2001). The methods for obtaining such distributions can be categorised into three different families: measurementbased probabilistic timing analyses, static probabilistic timing analyses, and hybrid probabilistic timing analyses.
The second class is a mature area of research and the interested reader may refer to Wilhelm et al. (2008) for an overview of these methods. A specific overview of cache analysis for deterministic replacement policies together with a comparison between deterministic and random cache replacement policies is provided at the end of this section.
2.1.1 Probabilistic timing analyses
Measurementbased probabilistic timing analyses (Bernat et al. 2002; CucuGrosjean et al. 2012) collect observations on the execution time of the task under study on the target hardware. These observations are then combined, e.g. through the use of extreme value theory (CucuGrosjean et al. 2012), to produce the desired worstcase probabilistic timing estimate. Extreme Value Theory may potentially underestimate the pWCET of a program as shown by Griffin and Burns (2010). The work of CucuGrosjean et al. (2012) overcomes this limitation and also introduces the appropriate statistical tests required to treat worstcase execution times as rare events. The soundness of the results produced by such methods is tied to the observed execution times which should be representative of the ones at runtime. This implies a responsibility on the user who is expected to provide input data to exercise the worstcase paths, less the analysis results in unsound estimates (Lesage et al. 2015b). These methods nonetheless exhibit the benefits of timerandomised architectures. The occurrence probability of pathological temporal cases can be bounded and safely ignored provided they meet requirements expressed in terms of failure rates.
Path upperbounding (Kosmidis et al. 2014) defines a set of program transformations to alleviate the responsibility of the user to provide inputs which cover all execution paths. The alternative paths of conditional constructs are padded with semanticpreserving instructions and memory accesses such that any path followed in the modified program is an upperbound of any of the original alternatives. Measurementbased analyses can then be performed on the modified program as the paths exercised at runtime upperbound any alternative in the original application. Hence, upperbounding creates a distinction between the original code and the measured one. It may also result in paths which are the sum of the original alternatives.
Hybrid probabilistic timing analyses are methods that apply measurementbased methods at the level of subprograms or blocks of code and then operations such as convolution to combine these bounds to obtain a pWCET for the entire program. The main principles of hybrid analysis were introduced by Bernat et al. (2002, 2003) with execution time probability distributions estimated at the level of subprograms. Here, dependencies may exist among the probability distributions of the subprograms and copulas are used to describe them (Bernat et al. 2005).
By contrast, SPTAs derive the pWCET distribution for a program by analysing the structure of the program and modelling the behaviour of the hardware it runs on. Existing work on SPTA has primarily focussed on randomized architectures containing caches with random replacement policies. Initial results for the evictonmiss (Quinones et al. 2009) and evictonaccess (CucuGrosjean et al. 2012; Cazorla et al. 2013) policies were derived for singlepath programs. These methods use the reuse distance of each access to determine its probability of being a cache hit. These results were superseded by later work by Davis et al. (2013) who derived an optimal lower bound on the probability of a cache hit under the evictonmiss policy, and showed that evictonmiss dominates evictonaccess. Altmeyer and Davis (2014) proved the correctness of the lower bound derived in Davis et al. (2013), and its optimality with regards to the limited information that it uses (i.e. the reuse distance). They also showed that the probability functions previously given in Kosmidis et al. (2013) and Quinones et al. (2009) are unsound (optimistic) for use in SPTA. In 2013, a simple SPTA for multipath programs was introduced by Davis et al. (2013), based on path merging. With this method, accesses are represented by their reuse distances. The program is then virtually reduced to a single sequence which upperbounds all possible paths with regards to the reuse distance of their accesses.
In 2014, more sophisticated SPTA methods for single path programs were derived by Altmeyer and Davis (2014). They introduced the notion of cache contention, which combined with reuse distance enables the computation of a more precise bound on the probability that a given access is a cache hit. Altmeyer and Davis (2014) also introduced a significantly more effective method based on combining exhaustive evaluation of the cache behaviour for a limited number of relevant memory blocks with cache contention. This method provides an effective tradeoff between analysis precision and tractability. Griffin et al. (2014a) introduces orthogonal Lossy compression methods on top of the cache states enumeration to improve the tradeoff between complexity and precision.
Altmeyer and Davis further refined their approach to SPTA for single path programs in 2015 (Altmeyer et al. 2015), bridging the gap between contention and enumerationbased analyses. The method relies on simulation of the behaviour of a random replacement cache. As opposed to exhaustive state analyses however, focus is set at each step on a single cache state to capture the outcome across all possible states. The resulting approach offers an improved precision over contentionbased methods, at a lower complexity than exhaustive state analyses.
In this paper, we build upon the stateoftheart approach (Altmeyer and Davis 2014), extending it to multipath programs. The techniques introduced in the following notably allow for the identification on control flow convergence of relevant cache contents, i.e. the identification of the outcomes in multipath programs. The approach focuses on the enumeration of possible cache states at each point in the program. To reduce the complexity of such an approach, only a few blocks, identified as the most relevant, are analysed at a given time.
2.1.2 Deterministic architectures and analyses
Static timing analysis for deterministic caches (Wilhelm et al. 2008) relies on a two step approach with a lowlevel analysis to classify the cache accesses into hits and misses (Theiling et al. 1999) and a highlevel analysis to determine the length of the worstcase path (Li and Malik 2006). The most common deterministic replacement policies are leastrecently used (LRU), firstin firstout (FIFO) and pseudoLRU (PLRU). Due to the highpredictability of the LRU policy, academic research typically focusses on LRU caches–with a wellestablished LRU cache analysis based on abstract interpretation (Alt et al. 1996; Theiling et al. 1999). Only recently, analyses for FIFO (Grund and Reineke 2010) and PLRU (Grund and Reineke 2010; Griffin et al. 2014b) have been proposed, both with a higher complexity and lower precision than the LRU analysis due to specific features of the replacement policies. Despite the focus on LRU caches and its analysability, FIFO and PLRU are often preferred in processor designs due to the lower implementation costs which enable higher associativities.
Recently, Reineke (2014) observed that SPTA based on reuse distances (Davis et al. 2013) results, by construction, in less precise bounds than existing analyses based on stack distance for an equivalent system with a LRU cache (Wilhelm et al. 2008). However, this does not hold for the more sophisticated SPTA based on cache contention and collecting semantics given by Altmeyer and Davis (2014). Analyses for deterministic LRU caches are incomparable with these analyses for random replacement caches. This is illustrated by our evaluation results. It can also be seen by considering simple examples such as a repeated sequence of accesses to five memory blocks \(\langle a,b,c,d,e,a,b,c,d,e\rangle \) with a fourway associative cache. With LRU, no hits can be predicted. By contrast, with a random replacement cache and SPTA based on cache contention, four out of the last five accesses can be assumed to have a nonzero probability of being a cache hit (as shown in Table 1 of Altmeyer and Davis 2014), hence SPTA for a random replacement cache outperforms analysis of LRU in this case. We note that in spite of recent efforts (de Dinechin et al. 2014) the stateless random replacement policies have lower silicon costs than LRU, and so can potentially provide improved realtime performance at lower hardware cost.
Early work (David and Puaut 2004; Liang and Mitra 2008) in the domain of SPTA for deterministic architectures relied for its correctness on knowledge of the probability that a specific path would be taken or that specific input data would be encountered; however, in general such assumptions may not be available. The analysis given in this paper does not require any assumption about the probability distribution of different paths or inputs. It relies only on the random selection of cache lines for replacement.
2.2 Organisation
In this paper, we introduce a set of methods that are required for the application of SPTA to multipath programs. Section 2 recaps the assumptions and methods upon which we build. These were used in previous work (Altmeyer and Davis 2014) to upperbound the pWCET distribution of a trace corresponding to a single path program. We then proceed by defining key properties which allows the ordering of cache states w.r.t. their contribution to the pWCET of a program (Sect. 3). We address the issue of multipath programs in the context of SPTA in Sect. 4. This includes the definition of conservative (overapproximate) join functions to collect information regarding cache contention, possible cache contents, and the pWCET distribution at each program point, irrespective of the path followed during execution. Further improvements on cache state conservation at control flow convergence are introduced in Sect. 5. Section 6 introduces simple program transformations which improve the precision of the analysis while ensuring that the pWCET distribution of the transformed program remains sound (i.e. upperbounds that of the original). Multipath SPTA is applied to a selection of benchmarks in Sect. 7 and the precision and runtime of the different approaches compared. Section 8 concludes with a summary of the main contributions of the paper and a discussion of future work.
3 Static probabilistic timing analysis
In this section, we recap on stateoftheart SPTA techniques for single path programs (Altmeyer and Davis 2014). We first give an overview of the system model assumed throughout the paper in Sect. 2.1. We further recap on the existing methods (Altmeyer and Davis 2014) to evaluate the pWCET of a single path trace using a collecting approach (Sect. 2.2) supplemented by a contention one. The pertinence of the model is discussed at the end of this section. The notations introduced in the present contributions have been summarised in Table 1.
Summary of introduced notations
Notation  Description 

pWCET  Upperbound on the execution time distribution of a program over all paths 
\(\mathcal {H}\)  Upperbound on the latency incurred by a cache hit 
\(\mathcal {M}\)  Upperbound on the latency incurred by a cache miss 
N  Cache associativity 
\(\mathbb {E}\)  Set of accessed cache blocks 
\(\mathbb {E}^{\bot } \)  Set of accessed cache blocks including nonrelevant elements \(\bot \) 
\(t = [e_1,\ldots ,e_i]\)  A trace, a sequence of accesses to memory blocks 
\(\mathcal {D}\)  Execution time or cache miss probabilistic distribution 
\(\mathcal {D}(x)\)  Occurrence probability of execution time x 
\(P(\mathcal {D} \ge x)\)  Likelyhood that distribution \(\mathcal {D}\) exceeds execution time x 
\(s \in \mathbb {CS}\)  Analysed cache state 
\((C,P,\mathcal {D}) = s \)  Analysed cache state including: 
 C: Cache contents, set of blocks known to be present in cache;  
 P: Occurrence probability of the cache state at a specific program point;  
 \(\mathcal {D}\): Execution time distribution up to a specific program point  
\(\mathcal {D}_{\textit{init}}\)  Initial, empty execution time distribution 
\(S \in 2^{\mathcal {CS}}\)  Set of possible caches states at a specific program point 
\(S \uplus S'\)  Weighted merge on cache states, merge probability and distributions for cache states with identical contents 
u(s, e)  Update cache state s upon access to element e, replacing a line and increasing the corresponding distribution D upon a miss 
U(S, e)  Update each cache state in set S upon access to element e 
\(\textit{rd}(e, t)\)  Reuse distance of element e in trace t, upperbound on the number of evictions since the last access to e 
\(\textit{frd}(e, t)\)  Forward reuse distance of element e in trace t, upperbound on the number of evictions before the next access to e 
\(\textit{con}(e, t)\)  Cache contention for element e in trace t, bound on the number of blocks contending for cache space since the last access to e 
\(\hat{P}(e^{\textit{hit}}_i)\)  Lowerbound on the probability of access \(e_i\) to hit in cache 
\(\hat{\xi _{i}}\)  Upperbound on the execution time probability of element \(e_i\), expressed as a probability mass function 
\(\hat{\mathcal {D}}(t)\)  Upperbound on the execution time distribution of trace t 
\(\mathcal {D}(t,s)\)  Execution time distribution of trace t starting from cache state s 
\(\mathcal {D}(t,S)\)  Execution time distribution of trace t starting from possible cache states S 
\(\mathcal {D} \otimes \mathcal {D}'\)  Convolution of distributions \(\mathcal {D}\) and \(\mathcal {D}'\) 
\(\mathcal {D} \odot \mathcal {D}'\)  Least upperbound of distributions \(\mathcal {D}\) and \(\mathcal {D}'\) 
\(\mathcal {D} \le \mathcal {D}'\)  Distribution \(\mathcal {D}'\) upperbounds \(\mathcal {D}\), iff \(\forall x, P(\mathcal {D} \ge x) \le P(\mathcal {D}' \ge x)\) 
\(G = (V, L, v_s, v_e)\)  Control flow graph G capturing possible paths in a program, including: 
V: Set of nodes in the program, each corresponding to an accessed element;  
L: Set of edges between nodes;  
\(v_s \in V\): Start node in the program;  
\(v_e \in V\): End node in the program  
\(\pi = [v_1,\ldots ,v_k]\)  Path from node \(v_1\) to \(v_k\), valid sequence of nodes in a CFG 
\(v_i \rightarrow * v_j\)  Set of paths from \(v_i\) to \(v_j\) 
\(\textit{dom}(v_n)\)  Dominators of node \(v_n\), nodes guaranteed to be traversed before \(v_n\) from the CFG entry \(v_s\) 
\(post\text {}{} \textit{dom}(v_n)\)  Postdominators of node \(v_n\), nodes guaranteed to be traversed after \(v_n\) to the CFG exit \(v_e\) 
\(\varPi (V)\)  All paths with nodes included exclusively in set of vertices V 
\(\varPi (G)\)  All paths from the start to the end of CFG G 
\(\hat{\mathcal {D}}(\pi )\)  Upperbound on the execution time distribution of path \(\pi \) 
\(\hat{\mathcal {D}}(G)\)  pWCET of G, upperbound on the execution time of its paths 
\(\textit{rd}^G(v)\)  Maximum reuse distance of node v across all paths in G leading to v 
\(\textit{frd}^G(v)\)  Maximum forward reuse distance of node v across all paths in G leading to v 
\(\textit{con}^G(v)\)  Maximum contention of node v across all paths in G leading to v 
\(s \sqsubseteq S \)  Cache state s holds less pessimistic information than the set of cache states S 
\(S \sqsubseteq S'\)  The set of cache states S holds less pessimistic information than states in \(S'\) 
\(S \sqcup S'\)  Upperbound on cache states S and \(S'\), more pessimistic than both S and \(S'\) 
\(C \le _{rank}C '\)  Ranking of cache contents C, used for heuristic comparison of contents based on their expected contribution to execution time distribution 
Flush(S)  Empty the contents of all cache states in S 
3.1 Cache model
We assume a single level, private, Nway fullyassociative cache with an evictonmiss random replacement policy. On an access, should the requested memory block be absent from the cache then the contents of a randomly selected cache line are evicted. The requested memory block is then loaded into the selected location. Given that there are N ways, the probability of any given cache line being selected by the replacement policy is \(\frac{1}{N}\). We assume a fixed upperbound on the hit and miss latencies, denoted by \(\mathcal {H}\) and \(\mathcal {M}\) respectively, such that \(\mathcal {H} < \mathcal {M}\). (We note that the restriction to a fullyassociative cache can be easily lifted for a setassociative cache through the analysis of each cache set as an independent fullyassociative cache.)
3.2 Collecting semantics
We now recap on the collecting semantics introduced by Altmeyer and Davis (2014) as a more precise but more complex alternative to the contentionbased method of computing pWCET estimates. This approach performs exhaustive cache state enumeration for a selection of relevant accesses, hence providing tight analysis results for those accesses. To prevent state explosion, at each point in the program no more than R memory blocks are relevant at the same time. The relevant accesses are ones heuristically identified as benefiting the most from a precise analysis.
A trace t is defined as an ordered sequence \([e_1,\ldots ,e_n]\) of n accesses to memory blocks, such that \(e_i = e_j\) if accesses \(e_i\) and \(e_j\) target the same memory block. If access \(e_i\) is relevant, the block it accesses will be considered relevant until the next nonrelevant access to the same block. The precise approach is only applied for relevant accesses while the contentionbased method outlined in Sect. 2.2.1 is used for the others, identified as \(\bot \) in the trace of relevant blocks. The set of elements in a trace becomes \(\mathbb {E}^{\bot } = \mathbb {E} \cup \{\bot \}\).
The abstract domain of the analysis is a set of cache states. A cache state is a triplet \(CS = (C,P,\mathcal {D})\) with cache contents C, a corresponding probability \(P \in \mathbb {R}, 0 < P \le 1\), and a miss distribution \(\mathcal {D}:\mathbb {N} \rightarrow \mathbb {R}\) when the cache is in state C. C is a set of at most N memory blocks picked from \(\mathbb {E}\). A cache state which holds less than N memory blocks represents partial knowledge about the cache contents without any distinction between empty lines or unknown contents.^{1} The set of all cache states is denoted by \(\mathbb {CS}\). Miss distribution \(\mathcal {D}\) captures for each possible number of misses n, the probability that n misses occurred from the beginning of the program up to the current point in the program. The method computes all possible behaviours of the random cache with the associated probabilities. It is thus correct by construction as it simply enumerates all states exhaustively.
3.2.1 Nonrelevant blocks analysis
One possible naive approach for nonrelevant blocks would be to classify them as misses in the cache and add the resulting latency to the previously computed distributions. The collecting approach proposed by Altmeyer and Davis (2014) relies on the application of the contention methods to estimate the behaviour of the nonrelevant blocks in a trace. Each access in a trace has a probability of being a cache hit \(P(e_i^{\textit{hit}})\), and of being a cache miss \(P(e_i^{\textit{miss}})=1P(e_i^{\textit{hit}})\). These methods rely on different metrics to lowerbound the hit probability of each access such that the derived bound can be soundly convolved.
Note that this definition of contention is an improvement on the one proposed in earlier work. Instead of accounting for each access independently, we account for their accessed blocks instead. The reasoning behind this optimisation is that if an accessed block hits more than once, it does not occupy additional lines. In the previous example, b is only accounted for once in the contention of \(a^2\) and \(c^3\). The subtle difference lies in (17) where the blocks \(e_j\) are accounted for instead of each access j individually (\(e_i = e_j\) if they access the same block).
3.3 Discussion: relevance of the model
The SPTA techniques described apply whether the contents of the memory block are instruction(s), data or both. While address computation (Huynh et al. 2011) may not be able to pinpoint the exact target of an access, e.g. for datadependent requests, relational analysis (Hahn and Grund 2012), introduced in the context of deterministic systems, can be used to identify accesses which map to the same or different sets, and access the same or different block. Two accesses which obey the same block relation can then be replaced by accesses to the same unique element, hence improving the precision of the analysis.
The methods assume that there are no intertask cache conflicts due to preemption, i.e. a runtocompletion semantics with nonpreemptable program execution. Concurrent cache accesses are also precluded, i.e. we assume a private cache or appropriate isolation (Chiou et al. 2000).
In practice, detailed analysis could potentially distinguish between different latencies for each access, beyond \(\mathcal {M}\) and \(\mathcal {H}\), but such precise estimation of the miss latency requires additional analysis steps, e.g. analysis of the main memory (Bourgade et al. 2008). Further, to reduce the pessimism inherent in using a simple bound, particularly for the miss latency, events such as memory refresh can be accounted for as part of higher level schedulability analyses (Atanassov and Puschner 2001; Bhat and Mueller 2011).
4 Comparing cache contents
The execution time distribution of a trace in our model depends solely on the behaviour of the cache. The contribution of a cache state to the execution time of a trace thus solely depends on its initial contents. The characterisation of the relation between the initial contents of different caches allows for a comparison of their temporal contribution to the same trace. This section introduces properties and conditions that allow this comparison. They are used in later techniques to improve the selection of cache contents on path convergence, and identify paths with the worst impact on execution time.
An Ntuple represents the concrete contents of an Nway cache, such that each element corresponds to the block held by a single line. The symbol \(\_\) is used to denote an empty line. For each such concrete cache s, there is a corresponding abstract cache contents C which holds the exact same set of blocks. C might also capture uncertainty regarding the contents of some lines.
Given cache state \(s = \langle l_1,\ldots ,l_N \rangle \),^{3} \(s[l_i=b]\) represents the replacement of memory block or line \(l_i\) in cache by memory block b. Note that b can only be present once in the cache, \(b \in s \Rightarrow s[l_i = b] = s\). \(s[l_i]\) is a shorthand for \(s[l_i=\_]\) and identifies the eviction of memory block \(l_i\) from the cache. \(s[l_i=b][l_j=e]\) denotes a sequence of replacements where b first replaces \(l_i\) in s, then e replaces \(l_j\). Two cache states s and \(s'\) although not strictly identical may exhibit the same behaviour if they hold the exact same contents, e.g. \(\langle a,\_\rangle = \langle \_,a \rangle \) are represented using the same abstract contents \(\{a\}\). Under the evictonmiss random replacement policy, there is no correlation between the physical and logical position of a block with respects to the eviction policy.
Theorem 1
The eviction of a block from any input cache state s cannot decrease the execution time distribution of any trace t, \(\mathcal {D}(t,s) \le \mathcal {D}(t,s[e])\).
Proof
See Appendix.\(\square \)
Corollary 1
In the context of evictonmiss randomised caches, for any trace, the empty state is the worst initial state over any other input cache state s, \(\mathcal {D}(t,s) \le \mathcal {D}(t,\emptyset )\).
The eviction of a block might trigger additional misses, resulting in a distribution that is no less than the one where the cache contents is left untouched. This provides evidence that the assumption upon a nonrelevant access that a block in cache is evicted, as per the update function in (3 4 5), is sound. Similarly, the replacement of a block in the cache might trigger additional misses but might also result in additional hits instead upon reuse of the replacing block. The impact of such a behaviour is however bounded.
Theorem 2
The replacement of a random block in cache triggers at most one additional hit.
Proof
See Appendix.\(\square \)
The block selected for eviction impacts the likelihood of those additional latencies suffered during the execution of the subsequent trace. Intuitively, the closer the evicted block is to reuse, the worse the impact of the eviction. We use the forward reuse distance of blocks at the beginning of trace t, \(\textit{frd}(b,t)\) as defined in (14), to identify the blocks which are closer to reuse than others.
Theorem 3
The replacement of a block in input cache state s by one which is reused later in trace t cannot result in a decreased execution time distribution: \(\textit{frd}(b,t) \le \textit{frd}(e,t) \le \infty \wedge b \in s \wedge e \notin s \Rightarrow \mathcal {D}(t,s) \le \mathcal {D}(t,s[b=e]) \)
Proof
See Appendix.\(\square \)
5 Application of SPTA to multipath programs
In this section, we improve upon the stateoftheart SPTA techniques for traces (Altmeyer and Davis 2014) recapitulated in Sect. 2 and present methods for multipath programs, that is complete controlflow graphs. A naive approach would be to compute all possible traces \(\mathcal {T}\) of a task, analyse each independently and combine their distributions. However, there are two significant problems with such an approach.
Our analysis on controlflow graphs resolves these problems. It relies on the collecting and the contention approaches for relevant and nonrelevant blocks respectively, as per the cache collecting approach on traces given by Altmeyer and Davis (2014). First, the loops in the controlflow graph are unrolled. This allows the implementation of the following steps, the computation of cache contention, the identification of relevant blocks and the cache collection, to be performed as simple forward traversals of the control flow graph. Approximation of the possible incoming states on path convergence keeps the analysis tractable. Finally, the contention and collecting distributions are combined using convolution.
5.1 Program representation
We say that a node \(v_d\) dominates \(v_n\) in the controlflow graph G if every path from the start node \(v_s\) to \(v_n\) goes through \(v_d\), \(v_s \rightarrow ^* v_n = v_s \rightarrow ^* v_d \rightarrow ^* v_n\), where \(v_s \rightarrow ^* v_d \rightarrow ^* v_n\) is the set of paths from \(v_s\) to \(v_n\) through \(v_d\). Similarly, a node \(v_p\) postdominates \(v_n\) if every path from \(v_n\) to the end node \(v_e\) goes through \(v_p\), \(v_n \rightarrow ^* v_e = v_n \rightarrow ^* v_p \rightarrow ^* v_e\). We refer to the set of dominators and postdominators of node \(v_n\) as \(\textit{dom}(v_n)\) and \(post\text {}{} \textit{dom}(v_n)\) respectively.
We assume that the program always terminates. Bounded recursion and loop iterations are requirements to ensure this termination property of the analysed application. The additional restrictions described below are for the most part tied to the WCET analysis framework (Wilhelm et al. 2008) and not exclusive to the new method. These are reasonable assumptions for the software in critical realtime systems.
Calls are also subject to a small set of restrictions to guarantee the termination of the program. Recursion is assumed to be bounded, that is cycles or repetitions in the call graph of the analysed application must have a maximum number of iterations, similarly for loops in the control flow. Function pointers can be represented as multiple targets attached to a single call. Here, the set of target functions must be exact or an overestimate of the actual ones, so as to avoid unsound estimates which do not take all valid paths into account.
5.2 Complete loop unrolling
In the first analysis step, we conceptually transform the controlflow graph into a directed acyclic graph by loop unrolling and function inlining (Muchnick 1997). In contrast to the naive approach of enumerating all possible traces, analysis through complete loop unrolling has linear rather than exponential complexity with the number of loop iterations.
Loop unrolling and function inlining are wellknown techniques to improve the precision of dataflow analyses. A complete physical unrolling that removes all backedges significantly increases the size of the controlflow graph. A virtual unrolling and inlining is instead performed during analysis such that calls and iterations are processed as required by the control flow. The analysis then distinguishes the different call and iteration contexts of a vertex. In either case, the size of the graph explored during analysis and its complexity scales with the number of accesses in the program under consideration.
Unrolling simplifies the analysis and significantly improves the precision. As opposed to state of the art analyses for deterministic replacement policies (Alt et al. 1996), the analysis of random caches through cache state enumeration does not rely on the computation of a fixpoint. The abstract domain for the analysis is by nature growing with every access since it includes the estimated distribution of misses. Successive iterations increase the probability of blocks in the loop’s working set being in the cache, and in turn increase the likelihood of hits in the next iteration. The exhaustive analysis, if not supplemented by other methods, must process all accesses in the program.
We assume in the following that unrolling is performed on all analysed programs. Section 6.4.2 discusses preliminary work to bypass this restriction. The analysis of large loops, with many predicted iterations, can be broken down into the analysis of a single iteration or groups thereof provided a sound upperbound of the input state is used. The contributions of different segments are then combined to compute that of the complete loop or program. Such an upperbound input can be derived as an example using cache state compression (Griffin et al. 2014a) to remove low value information. The definition of techniques to exploit the resulting tradeoff between precision and analysis complexity is left as future work.
5.3 Reuse distance/cache contention on CFG
We then traverse the unrolled controlflow graph in reverse postorder, compute the distributions with the contentionbased approach, and use the maximum distribution on path convergence, with the maximum operator \(\odot \) as the join operator.
5.4 Selection of relevant blocks
The selection of relevant blocks in Altmeyer and Davis (2014) also needs to be modified to accommodate for a controlflow graph. Cache state enumeration is only performed for relevant accesses, ensuring more precise analysis results for the selected accesses. Earlier work (Altmeyer and Davis 2014) relied on an absolute set of R relevant blocks for the whole trace. Instead, we only restrict ourselves to at most R relevant blocks at any point in the program. Given a position in the controlflow, the heuristic tracks the R blocks with the shortest lifespan, i.e. the shortest distance between their last and next access. Such accesses are among the most likely to be kept in the cache and benefit from a precise estimate of their hit probability through state enumeration. Note that this heuristic relies on a lower bound on the lifespan of blocks instead of an upper bound.
5.5 Approximation of cache states
We assume no information about the probability of taking one path or another, hence the join operator must combine cache states in such a way that the resulting state is an overapproximation of all incoming paths, i.e. it contains the same or degraded information. To capture this property, we introduce the partial ordering \(\sqsubseteq \) between a cache state and a set thereof such that \(s \sqsubseteq S_b\) implies that \(S_b\) holds more pessimistic information than s, resulting in more pessimistic timing estimates. We overload this operator to relate sets of cache states where \(S_a \sqsubseteq S_b\) implies that \(S_b\) holds more pessimistic information than \(S_a\). More formally, the \(\sqsubseteq \) notation (Peleska and Löding 2008) identifies \(S_b\) as an upperbound of \(S_a\) in \(2^{\mathbb {CS}}\).
Consider a simple cache state \(s = (\{a,b\},0.5,\mathcal {D})\). Intuitively, the information represented by \(s_{a} = (\{a\},0.5,\mathcal {D})\) is more pessimistic than that captured by s, \(s\sqsubseteq {s_a}\). Conversely, \(s_c = (\{a,c\},0.5,\mathcal {D})\) holds less pessimistic information regarding c, so \(s \not \sqsubseteq s_c\). The set \(S = \{ (\{a\},0.25,\mathcal {D}), (\{b\},0.25,\mathcal {D}) \}\) also approximates s, \(s \sqsubseteq S\); the knowledge that a and b are both present in the cache (s) is reduced to guarantees only about the presence of either a or b in S. As a consequence, the sequence of accesses abab will trigger more misses starting from states in S, than from state s. Assuming \(\mathcal {D} < \mathcal {D}'\), then \(s' = (\{a,b\},0.5,\mathcal {D}')\) holds more pessimistic information than s, \(s \sqsubseteq s'\).
The definition of overapproximations and their contribution to the execution time distribution of a trace relies on the merge \(\uplus \) and convolution \(\otimes \) operators defined respectively in (6 7 8) and (20). Both offer properties used in the evaluation of the contribution of their operands. The convolution operator preserves the relative ordering between its inputs, and the merge operation adds the contribution of its operands.
Lemma 1
Proof
See Appendix.\(\square \)
Lemma 2
Proof
See Appendix.\(\square \)
Theorem 4
Proof
The \(\sqsubseteq \) relation defines a partial ordering between two sets of cache states \(S_a\) and \(S_b\). Namely, \(S_a \sqsubseteq S_b\) implies that \(S_b\) holds more pessimistic information than \(S_a\). In other words, the execution of any trace from \(S_b\) results in a larger execution time distribution than the execution of the same trace from \(S_a\). This provides sufficient ground for the definition of a sound join operation, one that upperbounds the upcoming contribution of cache states coming from different paths.
5.6 Join operation for cache collecting
We traverse the (directed acyclic) graph in reverse postorder and compute the set of cache states at each program point. The join operator \(\bigsqcup \) describes the combination of two dataflow states from two different sub paths.
\(S_a\)  \(S_b\) 

\((\{a,b,c\},24/64,\mathcal {D})\)  \((\{a,b,c,d\}, 6/64,\mathcal {D})\) 
\((\{a,b,d\}, 12/64,\mathcal {D})\)  
\((\{a,c,d\}, 18/64,\mathcal {D})\)  
\((\{b,c,d\}, 6/64,\mathcal {D})\)  
\((\{a,c\}, 12/64,\mathcal {D})\)  \((\{a,d\}, 12/64,\mathcal {D})\) 
\((\{b,c\}, 24/64,\mathcal {D})\)  \((\{b,d\}, 3/64,\mathcal {D})\) 
\((\{c,d\}, 6/64,\mathcal {D})\)  
\((\{c\}, 4/64,\mathcal {D})\)  \((\{d\}, 1/64,\mathcal {D})\) 
\(S'_a\)  \(S'_b\) 

\((\{a,b,c\},24/64,\mathcal {D})\)  \((\{a,b,c\}, 6/64,\mathcal {D})\) 
\((\{a,b\}, 12/64,\mathcal {D})\)  
\((\{a,c\}, 12/64,\mathcal {D})\)  \((\{a,c\}, 18/64,\mathcal {D})\) 
\((\{b,c\}, 24/64,\mathcal {D})\)  \((\{b,c\}, 6/64,\mathcal {D})\) 
\((\{a\}, 12/64,\mathcal {D})\)  
\((\{b\}, 3/64,\mathcal {D})\)  
\((\{c\}, 4/64,\mathcal {D})\)  \((\{c\}, 6/64,\mathcal {D})\) 
\((\{\}, 1/64,\mathcal {D})\) 
The set of common cache states H, with their minimal, guaranteed probability, is defined as \(H = \{(\{a,b,c\},6/64,\mathcal {D}), (\{a,c\},12/64,\mathcal {D}), (\{b,c\},6/64,\mathcal {D}), (\{c\},4/64,\mathcal {D})\}\).
\(\hat{C}_a\)  \(\hat{C}_b\) 

\((\{ \},18/64,\mathcal {D})\)  \((\{ \}, 12/64,\mathcal {D})\) 
\((\{ \}, 6/64,\mathcal {D})\)  
\((\{ \}, 18/64,\mathcal {D})\)  \((\{ \}, 12/64,\mathcal {D})\) 
\((\{ \}, 3/64,\mathcal {D})\)  
\((\{ \}, 2/64,\mathcal {D})\)  
\((\{\}, 1/64,\mathcal {D})\) 
\(S_a \bigsqcup S_b \)  

\((\{a,b,c\},6/64,D)\)  
\((\{a,c\},12/64,D)\)  
\((\{b,c\},6/64,D)\)  
\((\{c\},4/64,D)\)  
\((\{\},36/64,D)\) 
6 Improving on the join operation
The basic join operation introduced in the previous section focuses on the conservation of common cache states. Others, because their contents differ or their occurrence is bounded on alternative paths, are merged into the empty state. This results in a safe estimate of the information gathered from different paths. Yet, the method exhibits some limitations with regards to the information it conserves; the probability of occurrence of some blocks in cache, which we refer to as their capacity, is lost during the join process. We introduce a join function based on conserving this additional capacity of cache states. The function degrades the information about the presence of blocks in a cache to allocate, in a sound manner, its occurrence probability to a more pessimistic state. We first present a ranking heuristic used to identify the cache states to which capacity should be allocated to in priority in Sect. 5.1. The improved capacityconserving join is itself presented in Sect. 5.2.
6.1 Ranking cache states
The ordering \(\sqsubseteq \) introduced in Sect. 4.5 allows for the comparison of some cache states to each other irrespective of the upcoming trace of memory accesses. It is however a partial ordering and only compares two states with similar or included cache contents. As illustrated in Theorem 3, ordering the contribution of cache contents which do not include each other requires the consideration of future accesses as captured by their forward reuse distance. The definition of an optimal join operation, through the optimal allocation of capacity to cache states, should ideally minimise the execution time on the worstcase path. However, multiple, incomparable paths would need to be considered of which the worstcase is unknown. We instead rely on a heuristic to prioritise the most beneficial cache states through a ranking system.
Our aim is to improve precision in the pWCET estimate, hence the heuristic aims to preserve capacity for cache blocks that will upon their next access result in a high probability of a cache hit. This happens, at least on some forward paths, for blocks with a small forward reuse distance. Preserving capacity for blocks with a larger forward reuse distance would likely result in a smaller probability of a cache hit and a more pessimistic overall pWCET estimate. (Note the ranking is only a heuristic and we do not claim that it makes optimal choices.)
6.2 Capacity conserving join
The join operator introduced earlier may result in lost capacity if the contents of states on alternative paths do not exactly match. Consider states \(\{a,b,e\}\) and \(\{b,c,e\}\) respectively in \(S'_a\) and \(S'_b\) along with others. They both include states \(\{b,e\}\), \(\{b\}\), \(\{e\}\) and \(\emptyset \) and their capacity could be allocated to whichever is the highest ranking one. \(\{a,e\}\) on the other hand is a valid approximation of \(\{a,b,e\}\) in which it is included, but does not approximate \(\{b,c,e\}\).
\(S'_a\)  \(S'_b\) 

\((\{a,b,c\},24/64,\mathcal {D})\)  \((\{a,b,c\}, 6/64,\mathcal {D})\) 
\((\{a,c\}, 12/64,\mathcal {D})\)  \((\{a,c\}, 18/64,\mathcal {D})\) 
\((\{a,b\}, 12/64,\mathcal {D})\)  
\((\{b,c\}, 24/64,\mathcal {D})\)  \((\{b,c\}, 6/64,\mathcal {D})\) 
\((\{a\}, 12/64,\mathcal {D})\)  
\((\{c\}, 4/64,\mathcal {D})\)  \((\{c\}, 6/64,\mathcal {D})\) 
\((\{b\}, 3/64,\mathcal {D})\)  
\((\{\}, 1/64,\mathcal {D})\) 
\(\textit{contribution}_a\)  \(\textit{contribution}_b\) 

\((\{a,b,c\},18/64,\mathcal {D})\)  
\((\{a,c\}, 12/64,\mathcal {D})\)  \((\{a,c\}, 18/64,\mathcal {D})\) 
The presence of both a and c, captured by state \(\{a,c\}\), can therefore be guaranteed with probability \(\frac{18}{64}\) on both paths. The capacity of states in \(S'_a\) and \(S'_b\) is decreased accordingly (lines 15–21). Capacity is first picked from the lowest ranking states, such that in our example \(\{a,b,c\}\in S_a\) still has a remaining capacity of \(\frac{12}{64}\) (the \(\frac{18}{64}\) allocated to \(\{a,c\}\) minus the contribution of the lower ranking \(\{a,c\}\) in \(S_a\), \(\frac{12}{64}\)).
During this step, the execution time distribution obtained through the combination of the contributors’ distributions is also computed in \(d_a\) (see line 18) and \(d_b\) respectively for \(S'_a\) and \(S'_b\). An upperbound of \(d_a\) and \(d_b\) is used when computing the resulting distribution for the conserved state (line 24). After the second iteration of the algorithm, \(\textit{states}_2 = \textit{states}_1 \cup \{(\{a,c\},12/64,\mathcal {D})\}\).
\(\textit{states} = S_a \sqcup ^{capa} S_b\)  

\((\{a,b,c\},6/64,\mathcal {D})\)  
\((\{a,c\}, 18/64,\mathcal {D})\)  
\((\{a,b\}, 12/64,\mathcal {D})\)  
\((\{b,c\}, 6/64,\mathcal {D})\)  
\((\{a\}, 0/64,\mathcal {D})\)  
\((\{c\}, 6/64,\mathcal {D})\)  
\((\{b\}, 3/64,\mathcal {D})\)  
\((\{\}, 13/64,\mathcal {D})\) 
\(\textit{states} = S_a \sqcup ^{capa} S_b\)  

\((\{a,b,c\},6/64,\mathcal {D})\)  
\((\{a,c\}, 18/64,\mathcal {D})\)  
\((\{a,b\}, 12/64,\mathcal {D})\)  
\((\{b,c\}, 6/64,\mathcal {D})\)  
\((\{c\}, 6/64,\mathcal {D})\)  
\((\{b\}, 3/64,\mathcal {D})\)  
\((\{\}, 13/64,\mathcal {D})\) 
\( S_a \bigsqcup S_b \)  

\((\{a,b,c\},6/64,D)\)  
\((\{a,c\},12/64,D)\)  
\((\{b,c\},6/64,D)\)  
\((\{c\},4/64,D)\)  
\((\{\},36/64,D)\) 
The solution resulting from the application of \(\sqcup ^{capa}\) dominates that of the previously introduced join operation. Indeed, a state C can only accommodate soundly for itself or a state it includes. With the proposed ranking heuristic this corresponds to a lower ranking state which the algorithm explores after C itself. The capacity of C is first used for C in the algorithm. As a consequence, the capacity allocated to a state is at least its minimum capacity in \(S_a\) or \(S_b\), e.g. \(\frac{12}{64}\) for \(\{a,c\}\). This minimum is the capacity that was allocated to the state in the previous join implementation. Different ranking heuristics could potentially lose this dominance relation.
The capacity join further keeps the same timing information as the standard \(\sqcup \) operation. The combined distributions and their weights are the same, but attached as a result of the operation to different, less pessimistic cache states. The same fragment of distribution in the standard operation will account for fewer or the same amount of misses using the capacity join.
7 Worstcase path reduction
Approximations of the cache contention or the contents of abstract cache states occur on control flow convergence, when two paths in the control flow graph meet. This ensures the validity of the bounds computed by SPTA whatever the exercised path at runtime, while keeping the complexity of the analysis under control. The complete set of possible paths need not be made explicit; however, the loss of information that may occur on flow convergence decreases the tightness of the computed pWCET.
In most applications, there exists some redundancy among paths with regards to their contribution to the pWCET. If a path can be guaranteed to always perform worse than another (\(\mathcal {D}(\pi _b)\ge \mathcal {D}(\pi _a)\)), the contribution of the former to the pWCET dominates that of the latter, \(\mathcal {D}(\pi _b) = \mathcal {D}(\pi _b) \odot \mathcal {D}(\pi _a)\). In which case, the latter path can be removed from the set of paths considered by the analysis, hence reducing the complexity of the control flow, while preserving the soundness of the computed upperbound.
In this section, we define the notion of inclusion between paths and prove that path inclusion is a subcase of path redundancy; the execution time distribution of an including path dominates that of any paths it includes. Based on this principle, we introduce program transformations to safely identify and remove from the controlflow paths that are included in others. This improves the precision of the analysis.
Worstcase execution path (WCEP) reduction includes a set of varied modifications: empty conditions removal, worstcase loop unrolling, and simple path elimination. They apply on the logical level, during analysis, and unlike path upperbounding approaches (Kosmidis et al. 2014) do not require modifications of the object or source code for pWCET computation.
7.1 Path inclusion
A path is said to include another if it contains at least the same sequence of ordered accesses, possibly interleaved with additional ones. As an example, consider paths \(\pi _a = [a,b,c,e]\) and \(\pi _b = [a,b,c,d,a,e]\) where \(\pi _a\) is included in \(\pi _b\). The former path can be split into subpaths \(\pi _S=[a,b,c]\) and \(\pi _E=[e]\), such that \(\pi _a = [\pi _S,\pi _E]\). \(\pi _b\) can then be expressed as the interleaving of \(\pi _S\) and \(\pi _E\) with \(\pi _V = [d,a]\), i.e. \(\pi _b = [\pi _S,\pi _V,\pi _E]\). Similarly, \(\pi _b\) includes [a, c, d, e], but not [b, a, c].
Definition 1
(Including path) Let \(\pi _a\) and \(\pi _b\) be two paths, such that \(\pi _a\) is the concatenation of two subpaths \(\pi _S\) and \(\pi _E\): \(\pi _a = [\pi _S ,\pi _E]\). The inclusion of \(\pi _a\) in \(\pi _b\), denoted \(\pi _a \unlhd \pi _b\), is recursively defined as either \(\pi _b=[\pi _S,\pi _V,\pi _E\)] or, \(\pi _b = [\pi _S,\pi _V,\pi _E']\) where \(\pi _E \unlhd \pi _E'\) and \(\pi _E \ne \pi _E'\)
Theorem 5
The execution time distribution of a path \(\pi \) prefixed by an access to block b upperbounds that of path \(\pi \), \(\mathcal {D}(\pi ,s) + \mathcal {H} \le \mathcal {D}([[b],\pi ],s)\).
Proof
For the sake of readability, we omit in the following the cache state s when comparing the execution time distributions of two paths in the following; two paths are always compared using the same input cache state, \(\mathcal {D}(\pi ) \le \mathcal {D}(\pi ') \Leftrightarrow \mathcal {D}(\pi , s) \le \mathcal {D}(\pi ', s)\).
Theorem 6
The execution time distribution of a path \(\pi _a\) prefixed by path \(\pi _s\) upperbounds that of path \(\pi _a\) alone, \(\forall \pi _s, \pi _a, \mathcal {D}(\pi _a) \le \mathcal {D}([\pi _s,\pi _a])\).
Proof
From Theorem 5, we know that \(\mathcal {D}(\pi _a) \le \mathcal {D}([[v_n],\pi _a])\) which can be extended to \(\mathcal {D}(\pi _a) \le \mathcal {D}([[v_1,v_2,\ldots ,v_n],\pi _a])\) since \(\mathcal {D}([[v_2,\ldots ,v_n],\pi _a]) \le \mathcal {D}([[v_1,v_2,\ldots ,v_n],\pi _a])\) and so on. The relation holds for prefixes of arbitrary lengths. \(\square \)
Theorem 7
(Included path ordering) If \(\pi _a\) is included in \(\pi _b\), then the execution time distribution of \(\pi _b\) is greater than or equal to that of \(\pi _a\), \(\pi _a \unlhd \pi _b \Rightarrow \mathcal {D}(\pi _a) \le \mathcal {D}(\pi _b)\)
Proof
We prove this property by induction.
Base case: We need to prove that if \(\pi _a \unlhd \pi _b\) such that \(\pi _a = [\pi _S,\pi _E]\) and \(\pi _b = [\pi _S,\pi _V,\pi _E]\), then \(\mathcal {D}(\pi _a) \le \mathcal {D}(\pi _b)\). From Theorem 6, we know that \(\mathcal {D}(\pi _E) \le \mathcal {D}([\pi _V,\pi _E])\).
The execution of \(\pi _S\) cannot be impacted by accesses in either \(\pi _E\) or \(\pi _V\). It is therefore the same on both paths \(\pi _a\) and \(\pi _b\). As proved in Theorem 6, whatever cache state is left by the execution of \(\pi _S\), the execution time distribution of \([\pi _V,\pi _E]\) is either greater than or equal to that of \(\pi _E\). Therefore, \(\mathcal {D}(\pi _a) \le \mathcal {D}(\pi _b)\).
Inductive step: Let us assume \(\pi _a = [\pi _S,\pi _E]\) and \(\pi _E'\) is such that \(\pi _E \unlhd \pi _E'\) and \(\mathcal {D}(\pi _E) \le \mathcal {D}(\pi _E')\). We need to prove that for \(\pi _b = [\pi _S,\pi _V,\pi _E']\), \(\mathcal {D}(\pi _a) \le \mathcal {D}(\pi _b)\). From Theorem 5, we know that \(\mathcal {D}([\pi _V,\pi _E']) \ge \mathcal {D}(\pi _E')\), and as a consequence \(\mathcal {D}([\pi _V,\pi _E']) \ge \mathcal {D}(\pi _E)\). Further, the execution time distribution of \(\pi _S\) is not impacted by accesses in either \(\pi _V\), \(\pi _E\), or \(\pi _E'\) and is the same in \(\pi _a\) and \(\pi _b\), hence \(\mathcal {D}(\pi _a) \le \mathcal {D}(\pi _b)\). \(\square \)
We now extend the notion of path inclusion to sets of paths. A set of paths \(\varPi \) is a pathincluded set of \(\varPi ^{\circ }\) if each path in \(\varPi \) is included in a corresponding path in \(\varPi ^{\circ }\), \(\varPi \unlhd \varPi ^{\circ } \Rightarrow \forall \pi \in \varPi , \exists \pi ^{\circ } \in \varPi ^{\circ }, \pi \unlhd \pi ^{\circ }\). As a consequence, for each path \(\pi \in \varPi \), there is a path in \(\varPi ^\circ \) the actual pWCET of which also upperbounds the execution time distribution of \(\pi \). The actual pWCET of \(\varPi ^{\circ }\) is thus an upperbound on the execution time distributions of all paths in \(\varPi \), \(\forall \pi \in \varPi , \mathcal {D}(\varPi ^{\circ }) \ge \mathcal {D}(\pi )\). As the estimated pWCET of a path \(\hat{\mathcal {D}}(\pi )\) is an upperbound on its execution time distribution, \(\mathcal {D}(\pi ') \le \hat{\mathcal {D}}(\pi ')\), it is sufficient to perform the pWCET analysis of a CFG G on a reduced set of paths which pathincludes the set \(\varPi (G)\).
7.2 Empty conditions removal
Simple conditional constructs may induce paths that are included in others. In particular, any path that goes through an empty branch or case is included in any alternative branch which triggers memory accesses. The edges in a CFG which represent such cases can be safely removed to reduce path indeterminism during pWCET analysis, improving the precision of the results.
7.3 Loop unrolling
Natural loop constructs are a source of path redundancy. In particular, paths which do not exercise the maximum number of iterations of a loop they traverse have an including counterpart. An iteration of loop \(l=(v_h,V_l)\) starts with a transition from its header \(v_h\) to any of its nodes \(v_n\in V_l\). Conversely, any iteration, with the exception of the last, ends with a transition back to the header \(v_h\), through a backedge. The set of paths \(\varPi _{\textit{iter}} = [\varPi (V_l{\setminus }\{v_h\}),[v_h]]\) captures the paths followed during a complete iteration through loop l.
A valid path which captures n iterations can be expressed as \([[v_h],\pi _1,\ldots ,\pi _{n1},\) \(\pi _{\textit{last}}]\) with \(\forall i, 1 \le i < n, \pi _i \in \varPi _{\textit{iter}}\), and \(\pi _{\textit{last}}\) as the last iteration of the loop. \(\pi _{\textit{last}}\) is a path in \(\varPi (V_l{\setminus }\{v_h\})\) followed by a node outside the loop. We denote by \(\varPi _{n}\), the set of paths which iterate n times through the loop l. A path in \(\varPi _{n+1}\) can be expressed as \([[v_h],\pi _1,\ldots ,\pi _{n1},\pi _{n},\pi _{\textit{last}}]\) with \(\pi _{n} \in \varPi _{\textit{iter}}\), i.e. each path in \(\varPi _n\) is included in a path of \(\varPi _{n+1}\). By extension, the set of paths \(\varPi _{\textit{max}\text {}{} \textit{iter}(l)}\) pathincludes all other sets of paths which iterate over l at least once.
In our model, we only restrict the maximum number of iterations of a loop. Every iteration may be the last; there is no guarantee that a loop goes always through the same number of iteration when it is executed. The loop unrolling algorithm hence operates without knowledge of the exact number of iterations of the loop. Every unrolled iteration is connected to the successors of the loop. As per Theorem 7 and the inclusion property for consecutive loop iterations, it is sufficient for pWCET estimation to only consider paths where each loop, when executed, goes through its current maximum number of iterations. The unrolling of loop l assumes \(\textit{max}\text {}{} \textit{iter}(l,\textit{ctx})\) as the exact iteration count of loop l. In effect, when unrolling any iteration of loop l besides the last, edges from nodes in the loop to nodes outside l are discarded. Conversely, unrolling the last iteration implies conserving only the nodes and edges of l which lead to a loop exit.
The same principles hold for call inlining. Recursion is also a source of path redundancy. Recursive calls manifest as repetitions in the call stack of an application. Here, a single source node is attached to the CFG of each procedure, which identifies its start. The source node therefore behaves similarly to the head of a loop, and is a guaranteed entry to each call. The same logic applies to both natural loops and recursive calls. When performing virtual or physical inlining, the analysis forces recursion up to the defined bound.
7.4 Access renaming
Path inclusion relies on the verbatim sequence of accesses to detect redundancy between paths. Even the slightest dissimilarity between alternative sequences throws off the property. Some accesses are known to perform worse than others at a given point in time. Renaming an access in a sequence to a worse performing target one, i.e. changing the target of the access, can smooth the differences between paths such that the renamed path is included in an alternative path of its original counterpart. The renamed path then acts as an intermediate bound between the original one and the including alternative, hence providing an argument for the removal of the original path. We now introduce a set of conditions that ensure the dominance of the execution time distribution of a renamed path over its original counterpart. If all transformations from the original validate these properties, the renamed path dominates the original. The renamed path may further be included in an alternative path. The original is then known to be redundant with this alternative and can be omitted during analysis.
Let \(\pi =[v_1,v_2,\ldots ,v_{k1},v_k]\) be a sequence of k accesses. \(\pi (e\rightarrow b)\) denotes the renaming of all accesses to memory block e to b in \(\pi \), \(\pi (e\rightarrow b) = [v_1',v_2',\ldots ,v_{k1}',v_k']\) where \(\forall i\in [1,k], v_i' = v_i\) if \(v_i \ne e\) and \(v_i = b\) otherwise. By definition, renaming e to b has no impact on \(\pi \) if it does not access e. \(\pi (e\rightarrow b)(c\rightarrow d)\) identifies a rename from e to b followed by a rename from c to d on the resulting sequence. Note that if no destination block is used as a source block, the order of the renames is irrelevant. For instance \(\pi (e\rightarrow b)(c\rightarrow d) = \pi (c\rightarrow d)(e\rightarrow b)\), but \(\pi (e\rightarrow b)(b\rightarrow c) \ne \pi (b\rightarrow c)(e\rightarrow b)\).

No enclosure There is no access to b over the renamed sequence \(\pi _V\), \(\forall v_i \in \pi _V, v_i \ne b\).

Prefix ordering b is no more likely to be in the cache than e after \(\pi _S\) (before \(\pi _V\)). This occurs when the closest access to e before \(\pi _V\), that is the last access to e in \(\pi _S\), is posterior to the last access to b in \(\pi _S\), \(\textit{rd}(e,\pi _S) < \textit{rd}(b,\pi _S)\).

Suffix ordering b is no more likely to trigger a hit than e if present in cache after \(\pi _V\) (before \(\pi _E\)). The first access to e after \(\pi _V\), i.e. in \(\pi _E\), is before the first access to b, \(\textit{frd}(e,\pi _E) < \textit{frd}(b,\pi _E)\).
Theorem 8

there is no access to b in \(\pi _V\);

the reuse distance of e before \(\pi _V\) is smaller than that of b at this point;

the forward reuse distance of e at the end of \(\pi _V\) is smaller than that of b at this point.
Proof
See Appendix.\(\square \)
7.4.1 Simple path elimination
Access renaming allows for a wide range of transformations between paths within a program. We aim at reducing the set of paths that need to be considered during the analysis of an application without increasing its pWCET. An ideal solution would consider each path individually. Each should then be matched against its larger alternatives to check for inclusion using rename operations. This approach is impractical in practice due to the sheer number of paths and the complexity of the matching problem over large sequences of accesses.
The recursive IsRedundant method, outlined in Algorithm 3, focuses on asserting the redundancy of two subpaths of a CFG using renaming. The algorithm progresses access by access, each call to IsRedundant considers the first access in the renamed path \(\pi _v\) and possible matches in \(\pi _r\). It explores the following options (i) match the address on the two paths (line 8), (ii) attempt renaming the access on path \(\pi _v\) to one on path \(\pi _r\) (line 12), or (iii) skip an access on the longest trace (on line 7, the operation removes the head of path \(\pi _r'\)). If it reaches the end of path \(\pi _v\), that path is identified as redundant with respect to \(\pi _r\); there is a sequence of renames which results in its inclusion in \(\pi _r\). Conversely, if there are not enough accesses left in \(\pi _r'\) to match the ones in \(\pi _v'\), the algorithm returns false. Hence, renames only occur on the shortest path, as it does not hold enough accesses to include the longer one.
7.4.2 Control flow graph segmentation
WCEP reduction methods aim to remove included paths whose contribution to the execution time distribution is no greater than some alternative worstcase paths. This reduces the number of accesses to be analysed and impacts the complexity of the approach. To further reduce this contribution, we present preliminary work towards the reduction of the analysed program segments through CFG partitioning (Ballabriga and Cassé 2008). This method has been first explored by Pasdeloup (2014) through heuristics tailored for SPTA.
For a decomposition into consecutive SESE regions to be valid, the nodes that delimit the segments have to be executed in all paths in the CFG. Alternative paths stemming from the same branch must be part of the same region. Similarly, all nodes in a loop nest belong to a same region. Such nodes can be captured by the notion of postdominators: a node \(v_p\) postdominates \(v_n\) if every path from \(v_n\) to the end node \(v_e\) goes through \(v_p\). All valid candidate nodes have to be postdominators of the entry node \(v_s\).
8 Evaluation
In this section, we examine the precision and runtime behaviour of the multipath analysis introduced in this paper. In order to study the behaviour of the analysis with respect to different flow constructs, we provide results for a subset of the PapaBench application (Nemer et al. 2006), Debie (Holsti et al. 2000), and the Mälardalen benchmarks (Gustafsson et al. 2010). We present the results for a subset of benchmarks whose behaviour is representative of the ones observed across all experiments or illustrate interesting corner cases. Table 2 includes details for each benchmark on the maximum number of accesses, the distinct number of cache blocks, and the cyclomatic complexity Y of the CFG (without and with WCEP reduction) which lower bounds the number of paths. Also given are the analysis runtimes with 4 and 8 relevant blocks.
The controlflow graph and address extraction were performed using the Heptane (Colin and Puaut 2001) analyser, from the compiled MIPS R2000/R3000 executable obtained using GCC v4.5.2 without optimisations. We used the various different methods to evaluate the contribution of a 16way fullyassociative instruction cache with 32B lines.
The miss distribution for different benchmarks was computed using either the contentionbased approach, the collection one, using different numbers of relevant blocks R, or the reuse distancebased path merging method outlined by Davis et al. (2013). To provide a comparison with methods and replacement policies, a stateoftheart analysis (Theiling et al. 1999) was used to determine the single, predicted worstcase bound on the number of misses for a LRU cache using the same parameters. We also performed a set of \(10^8\) simulations of the random cache behaviour to use as a baseline, effectively providing a lower bound on the pWCET. Here, the successor to each vertex in the simulated path was picked randomly among all of its valid successors, thus exploring the possible paths.
Properties of the analysed benchmarks and analysis runtime with R relevant blocks
Longest path  Blocks  Runtime (s)  Y with reduction  

(accesses)  \(R = 4\)  \(R = 8\)  Off  On  
Mälardalen  
adpcm  35,010  240  556  3747  6281  3069 
bsort100  108,718  20  1545  31,301  9902  101 
bs  42  11  \(< 1\)  \(< 1\)  9  5 
cnt  1576  27  1  1  201  101 
compress  31,382  86  151  1047  3976  493 
crc  27,752  44  478  1023  4173  4169 
edn  67,631  166  549  17340  5  1 
expint  11,314  31  10  111  404  104 
fdct  841  106  \(< 1\)  2  1  1 
fft  18,409  141  78  432  609  587 
fibcall  125  8  \(< 1\)  \(< 1\)  2  1 
fir  992  22  \(< 1\)  2  31  11 
insertsort  769  16  \(< 1\)  1  1  1 
jfdctint  1059  96  \(< 1\)  4  65  1 
lcdnum  233  20  \(< 1\)  1  171  61 
ludcmp  3950  98  1  24  70  8 
matmult  63,839  28  481  5967  801  1 
minmax  26  22  1  \(< 1\)  9  5 
minver  726  167  2  1  7  1 
ndes  21,377  121  47  355  4219  1273 
nsichneu  2944  1377  107  103  1249  1 
ns  4349  20  1  33  2  2 
prime  5768  17  3  21  725  5 
qurt  1526  77  \(< 1\)  4  187  67 
select  1721  60  \(< 1\)  1  177  17 
sqrt  430  26  \(< 1\)  1  59  20 
statemate  1844  275  49  49  1841  1132 
st  67,538  163  127  780  971  221 
ud  2984  75  1  12  82  1 
Papabench  
t1  150  135  \(< 1\)  \(< 1\)  41  17 
t2  57  27  \(< 1\)  \(< 1\)  6  5 
t3  62  57  \(< 1\)  1  20  9 
t4  215  13  \(< 1\)  \(< 1\)  47  24 
t5  62  55  \(< 1\)  \(< 1\)  19  13 
t6  286  272  \(< 1\)  \(< 1\)  103  27 
t7  52  45  \(< 1\)  \(< 1\)  9  8 
t8  11  9  \(< 1\)  \(< 1\)  3  2 
t9  472  324  \(< 1\)  \(< 1\)  89  11 
t10  39,658  1073  2500  4742  16,602  10,513 
t11  11  9  \(< 1\)  \(< 1\)  6  4 
t12  33  34  \(< 1\)  \(< 1\)  18  10 
t13  581  675  \(< 1\)  \(< 1\)  204  26 
fly_by_wire  18,723  229  293  358  4355  1930 
Debie  
acquisition_task  18,664  205  18  490  3829  1273 
hit_trigger_handler  3367  83  4  9  671  471 
tc_execution_task  3131  417  3  13  368  251 
tc_interrupt_handler  77  91  1  1  39  27 
tm_interrupt_handler  24  30  2  2  9  7 
Estimated number of misses with LRU and random replacement caches
Number of estimated misses  

LRU  Random (\(10^{7}\))  
\(R = 4\)  \(R = 8\)  
Mälardalen  
adpcm  1570  13,173  6097 
bsort100  39,518  41,642  25,319 
bs  17  35  32 
cnt  239  674  450 
compress  3564  7808  4058 
crc  248  7138  5693 
edn  5608  29,018  22,546 
expint  320  1253  1107 
fdct  840  842  842 
fft  16,847  15,259  15,050 
fibcall  8  22  22 
fir  33  291  161 
insertsort  16  304  91 
jfdctint  739  800  748 
lcdnum  214  211  209 
ludcmp  836  2310  1990 
matmult  30  17,812  1665 
minmax  24  27  27 
minver  171  427  335 
ndes  5524  13,101  10,882 
nsichneu  2943  2840  2844 
ns  21  1296  145 
prime  17  1591  58 
qurt  1406  1205  1193 
select  856  856  856 
sqrt  392  355  351 
statemate  1802  1749  1775 
st  1740  27,044  26,372 
ud  406  1435  1005 
Papabench  
t1  150  137  137 
t2  31  38  38 
t3  62  59  59 
t4  79  188  158 
t5  62  59  59 
t6  278  268  268 
t7  51  49  49 
t8  11  10  10 
t9  334  343  343 
t10  7421  18,825  14,506 
t11  11  11  11 
t12  33  32  32 
t13  581  559  559 
fly_by_wire  12,840  15,126  13,822 
Debie  
acquisition_task  4033  11,475  10,147 
hit_trigger_handler  1345  2534  2529 
tc_execution_task  262  1432  1060 
tc_interrupt_handler  65  73  73 
tm_interrupt_handler  21  27  27 
The capacityconserving join heuristic which allocates capacity to the cache states identified as the most valuable dominates the standard implementation. When comparing the precision of the different analysis techniques in Sect. 7.1 we therefore rely on the most favourable configuration, i.e. with WCEP reduction active and using the capacityconserving join. The impact of the different mechanisms, joins and WCEP reduction, is further considered in Sects. 7.2 and 7.3 respectively. Finally, the complexity and runtime for different analysis configurations is evaluated in Sect. 7.4.
8.1 Relative precision of the analysis techniques
In general, the use of the cache collecting method improves the precision of the analysis over the merging or purely contentionbased approaches even on complex control flows, as illustrated by papabench t4 in Fig. 10c. On simple control flows, the two approaches behave similarly but the contention method still dominates the path merging method (see Fig. 10e). The merged path is as long as the longest path in the application but keeps the worst behaving accesses from shorter paths. When WCEP reduction can mostly extract the worstcase execution path, as with qurt in Fig. 10a, the main difference between the two approaches comes from the more precise estimation of the hit probability of individual accesses using contention methods.
The precision of the collection methods and the relative performance of LRU and random caches mostly depends on the size of the working set of tasks w.r.t. to the cache size or the number of relevant blocks. Similar behaviours were observed whether WCEP reduction successfully resulted in a single path or not. As the number of relevant blocks increases from 4 to 8, the estimates computed by the analysis improve. The gain is important on benchmarks like insertsort (see Fig. 10b) where some nested loops fit in the number of relevant blocks. However, precision is lost in qurt or ud w.r.t. the simulation results (see Fig. 10a, e) as the loops almost fit inside the cache but not within the number of relevant blocks. This also results in decreased performance w.r.t. LRU. The latter is in this case only subject to cold misses.
Another general observation is that as expected none of the distributions derived by analysis underestimates simulation. However, the simulationbased distributions cannot be guaranteed to be precise pWCET estimates. The simulations, lacking representative input data, may not exercise the worstcase paths. At best they provide lower bounds on the pWCET. We note that provision of representative input data is a key problem for measurementbased methods. There is no general conclusion regarding the dominance of the analysis of a LRU cache over simulation or analysis results for a randomised cache. When all iterative structures fit in the cache (see Fig. 10b), the LRU analysis outperforms the analysis of the random cache. As intraloop conflicts grow, the benefits of the random replacement policy emerge and the new methods can capture such locality, resulting in tighter estimates than the analysis for a deterministic platform (see Fig. 10f). WCEP reduction reduces the reuse distance considered during analyses, whereas the stack distance for the LRU analysis remains the same since Theorem 7 does not apply. The pathmerging approach under WCEP reduction may result in tighter estimates than the analysis of a deterministic replacement policy (see Fig. 10d).
The analysis results for the t4 and statemate benchmarks (see Fig. 10c, d) indicate that the cache collecting approach may sometimes compute more pessimistic estimates than the contention method. This behaviour stems from flow divergence in the control flow of both benchmarks. Path indeterminism hinders the relevant block heuristic, different blocks may be deemed as relevant on parallel paths. In such cases, upon flow convergence, the join function cannot keep blocks of either alternative. Further, the R relevant blocks are still considered as occupying cache space from the point of view of the nonrelevant ones, effectively reducing the cache size. This illustrates the need for more sophisticated heuristics which take into account the behaviour of the analysis on alternative paths, or vary the number of relevant blocks depending on the expected benefits, and the computational cost.
In summary, our evaluation results show that the approaches to multipath SPTA derived in this paper dominate and significantly improve upon the stateoftheart path merging approach, determining less than one third as many misses in some instances. They were also shown to be incomparable with LRU analysis.
8.2 Benefits of the join operations to collecting approaches
The selection of relevant blocks is undoubtedly an important factor in the precision of the cache collecting approach. We compared additional configurations of the analyser, assuming a fixed number of 8 relevant blocks, to examine the impact of the join operations on the precision of the analysis. In particular, the experiments presented from Fig. 11a–f introduce a non stateconserving approach on path convergence. Using configuration empty (identified by blue pentagons) the cache contents are set to \(\emptyset \) on path convergence and the miss distribution is the maximum distribution of the alternative paths. The capacity configuration (identified by orange triangles) on the other hand corresponds to the use of the improved join operator. The intermediate line identifies the simple join operation we first introduced in Sect. 4.6 (purple squares).
Benchmarks which exhibit locality across branches of their conditionals benefit from the join function, as illustrated by crc, lcdnum, expint and compress in Fig. 11a, b, d and e respectively. The combination of both WCEP reduction and capacity conservation on flow convergence leads to tighter pWCET estimates in the case of crc, lcdnum, expint and compress. Reduction cannot remove all branches as they may not fall under the required constraints. The lcdnum benchmark is composed of a switch statement. The later cases share blocks with the conditions of the earlier ones, but add their own blocks. Hence, the resulting cache states differ but include each other. They can be captured by the capacity conserving heuristic. By construction, the capacity conserving join results in the tightest estimates and provides important improvements over the standard join on the crc application. The benefits of the capacityconserving join over the standard one are more marginal on the compress benchmark (see Fig. 11e) which exhibits few branches with reused blocks not captured by the WCEP reduction.
8.3 Impact of WCEP reduction on analysis and simulation
Given a fixed configuration of the analysis (identified by a symbol and a colour), the distribution obtained with WCEP reduction is always smaller than the one obtained without it. In other words, the analysis is more precise when all transformations are active. Because of path redundancy, an increase in the number of relevant blocks can sometimes reduce the precision of the resulting estimate. This phenomenon still occurs when WCEP reduction is applied, but it is less prevalent.
The impact of the different transformations on the precision of the analysis results depends on the characteristics of the application to which they are applied. All transformations can be beneficial to benchmarks for the collecting approach. The contention approach may even benefit from empty path elimination (see Fig. 12a), when a block is accessed only on the nonempty alternative of a conditional its reuse distance gets lowered. For other accesses, such paths impact neither the reuse distance nor the contention as they hold no access. The elimination of redundant paths on the other hand increases the precision of the two methods.
The cnt benchmark, in Fig. 12b, illustrates an interesting scenario. When the empty branch elimination is used in combination with WCEP unrolling collecting methods get slightly less precise than when using WCEP unrolling on its own. This illustrates a limit of the ranking heuristic used by the capacityconserving join. Empty branches result in a reduced minimum forward reuse distance for some accesses. This in turn impacts the allocation of capacity to cache states on path convergence, resulting in a better allocation without empty branch elimination.
We performed a set of \(10^8\) simulations on the control flow graphs of benchmarks with and without reduction. WCEP reduction results in greater measured execution time distributions. The transformations proposed in this paper eliminate some but not all redundant paths and reduce the set of possible paths to a set more focussed on worstcase scenarios. As for the analyses methods, the impact of each transformation depends on the benchmark to which it is applied. However, the application of WCEP reduction in the general case is not sufficient to guarantee the representative character of the resulting paths. In the case of the expanded compress benchmark, conditionals within loop structures are kept and there is no guarantee as to which alternation of paths results in the worstcase. On the other hand, the expanded matmult benchmark consists of a single trace of accesses.
8.4 Execution time
The runtime of the analysis, using a C++ prototype implementation, is presented in Fig. 13 using the WCEP reduction method and 0 to 12 relevant blocks. Measurements were made on an 8core 64bit 3.4Ghz CPU using the Ubuntu 12.04 operating system, with 2 instances of the analyser running in parallel. WCEP reduction was used as it increases the precision of the estimated cache states, and also the analysis runtime. We observe a growth in runtime as the number of relevant blocks increases. The runtime of the analysis is also significantly higher for larger benchmarks, edn, compress, and ndes, which contain the largest number of nodes.
The complexity of the analysis is of the order of \(O(S\times m\times \textit{log}(m))\), where m is the number of accesses in the program and S upperbounds the number of possible cache states. S is the number of combinations of N or less elements picked amongst the R relevant blocks, when \(R < N\) then \(S = 2^R\). As demonstrated in the previous set of experiments, a limited number of relevant blocks is effective for typical cache associativities.
8.4.1 Reducing the complexity of the approach
The results presented in Fig. 13 focussed on the impact of the number of cache states through its ties to the relevant blocks R. The number of accesses m in each benchmark is fixed. We evaluate the impact of m on the complexity of the analysis in Fig. 14. It presents the runtime of the analysis of a repeated sequence of n accesses while assuming the same 16way cache as in our previous experiments. The number of blocks in the repeated sequence n, the number of relevant blocks R and the cache associativity N impact the possible number of cache states S and therefore the initial growth of the runtime. Once the set of cache states to consider stabilises, the runtime for the different configurations follows a similar \(m\times \textit{log}(m)\) growth curve.
We defined a simple algorithm to split a program into consecutive SESE with at least M nonguaranteed hits on their longest path (Sect. 6.4.2). Segments are analysed independently assuming an empty input cache, and the resulting pWCET convolved to compute that of the full program. This approach effectively reduces the set of cache states on region boundaries to the empty state, a safe overapproximation as defined in Sect. 4.5. The resulting analysis runtime for the largest benchmarks is presented in Fig. 15 assuming a segment size M of 1000 misses.
Program partitioning reduces the runtime of our method over the analysis of the program as a single segment (see Fig. 13). As the analysis is applied to samesized regions in all cases, the runtime of all benchmarks follow a similar growth with the number of relevant blocks. The remaining differences in runtime come from several factors. First, the length of the program impacts the complexity of the final convolution operation of the pWCET of each segment. Second, the consecutive segments on a multipath program may hold more than M misses. Splits can only occur on a restricted set of vertices, namely those which postdominate the entry of the CFG. Further, as shown in Fig. 14, misses and the working set of each segment impact the number of cache states kept during analysis. Finally, flow complexity also increases analysis time as more paths need to be considered in a single segment.
Figures 16, 17 and 18 present the distributions computed by the analyses for a relevant subset of the considered configurations. They present the analyses results for \(R=8\) relevant blocks using a single or multiple segments (filled or hollow triangles respectively). They also include the results for 12 relevant blocks under partitioning (hollow blue pentagons), as the runtime of this configuration is below that of the \(R=8\) single segment one. Simulations and deterministic LRU analyses results are also included (resp. with green squares and a dark purple line). WCEP reduction is active in all cases, except LRU.
We observed that the fft benchmark (Fig. 18) only marginally benefits from an increase in the number of relevant blocks. The approximations on segment boundaries have almost no impact on the precision of the computed estimates given a fixed number of relevant blocks, \(R=8\). There is little reuse between the identified SESE regions in the program.
9 Conclusion and perspectives
The main contribution of this paper is the introduction of a more effective approach to multipath SPTA for systems that use a cache with an evictonmiss random replacement policy. The methods presented in this paper build upon existing approaches for analysing singlepath programs. We have pointed out where existing techniques for deterministic or probabilistic analyses could be applied to make improvements (Pasdeloup 2014; Maxim et al. 2012; Wegener 2012; Theiling et al. 1999).
We introduced conditions for the computation of valid upperbounds on the possible cache states on control flow convergence and presented a compliant transfer function to illustrate the requirements. We further refined this join operation to improve the precision of the information kept on control flow convergence. This more sophisticated join operation relies on a heuristic ordering of cache states depending on their expected benefits in the upcoming accesses.
We also defined path redundancy, identifying path inclusion as a subcase of redundancy. Based on these results, we presented worstcase execution path (WCEP) reduction to reduce the set of paths explored by the analysis, improving the tightness of the resulting timing estimates. We identified and proved the validity of sufficient conditions for the application of access renaming. This transformation allows for the identification of redundant paths beyond simple inclusion.
Our evaluations show that the analysis derived is effective at capturing the cache locality exhibited by different applications. The new methods significantly outperform the existing path merging approaches, predicting less than a third as many misses in one of the benchmarks. More precise results can be attained at the cost of an increased, usercontrolled, complexity. They are also incomparable to estimates for deterministic LRU caches. The program transformations introduced proved effective at improving the precision of all SPTA configurations; of the 48 analysed benchmarks, 18 show the same or better estimated performance with a Random replacement cache while 31 perform better with an LRU cache.
9.1 Perspectives
This research can be extended in many ways. The transfer functions on control flow convergence compute valid bounds with regards to the ordering of cache states. They exhibit pessimism, different but more complex ranking heuristics could spread the capacity of cache states over more appropriate candidates. Second, the complexity of operations on the abstract domain contributes to the increasing runtime of the analysis as it traverses deep flow graphs. Future work could look at the interaction between existing methods to balance the complexity and the precision of the analysis. Another avenue for improvement is the heuristic for the selection of relevant cache blocks. More advanced approaches might improve the tightness of the results, or even introduce a varying number of relevant blocks across the application to focus the analysis effort on a specified area of the code.
Our approach integrates both cache behaviour and worstcase path estimation. Flow facts regarding loop iterations can be taken care of during unrolling. We nevertheless intend to take more flow facts into account to increase the applicability of the approach and further improve the WCEP reduction effect on reducing path complexity. We also intend to investigate the use of static methods to improve the representative character of the considered paths, and as a consequence ensure the soundness and improve the precision of the measurementbased approaches. Finally, the application of static probabilistic timing analysis to more complex cache configurations, including multiple levels of cache, remains an open problem (Lesage et al. 2013).
Footnotes
 1.
This suits evictonmiss caches which do not prioritize empty lines when filling the cache.
 2.
Note the precise execution time distribution is effectively that which would be observed by executing the trace an infinite number of times.
 3.
We assume a fullyassociative cache, but this restriction can be lifted to setassociative caches through the independent analysis of each set.
References
 AlZoubi H, Milenkovic A, Milenkovic M (2004) Performance evaluation of cache replacement policies for the SPEC CPU2000 benchmark suite. In: Proceedings of the 42nd annual Southeast regional conference. ACM, New York, pp 267–272Google Scholar
 Alt M, Ferdinand C, Martin F, Wilhelm R (1996) Cache behavior prediction by abstract interpretation. In: Science of computer programming. Springer, Heidelberg, pp 52–66Google Scholar
 Altmeyer S, Davis RI (2014) On the correctness, optimality and precision of static probabilistic timing analysis. In: 17th Conference on Design, Automation and Test in Europe (DATE)Google Scholar
 Altmeyer S, CucuGrosjean L, Davis RI (2015) Static probabilistic timing analysis for realtime systems using random replacement caches. Real Time Syst 51:77–123CrossRefzbMATHGoogle Scholar
 Atanassov P, Puschner P (2001) Impact of DRAM refresh on the execution time of realtime tasks. In: Proceedings of IEEE international workshop on application of reliable computing and communication, pp 29–34Google Scholar
 Ballabriga C, Cassé H (2008) Improving the WCET computation time by IPET using control flow graph partitioning. In: 8th International workshop on worstcase execution time analysis (WCET)Google Scholar
 Bernat G, Burns A, Newby M (2005) Probabilistic timing analysis: an approach using copulas. J Embed Comput 1(2):179–184Google Scholar
 Bernat G, Colin A, Petters S (2002) WCET analysis of probabilistic hard realtime systems. In: 23rd IEEE realtime systems symposium (RTSS), pp 279–288Google Scholar
 Bernat G, Colin A, Petters S (2003) pWCET: a tool for probabilistic worstcase execution time analysis of realtime systems. Tech. Report YCS3532003, Department of Computer Science, The University of YorkGoogle Scholar
 Bhat B, Mueller F (2011) Making DRAM refresh predictable. Real Time Syst 47:430–453CrossRefGoogle Scholar
 Bourgade R, Ballabriga C, Cassé H, Rochange C, Sainrat P (2008) Accurate analysis of memory latencies for WCET estimation. In: 16th Conference on realtime and network systems (RTNS)Google Scholar
 Burns A, Edgar S (2000) Predicting computation time for advanced processor architectures. In: Proceedings of the 12th Euromicro conference on realtime systems (EuromicroRTS’00)Google Scholar
 Cazorla F, Quiñones E, Vardanega T, Cucu L, Triquet B, Bernat G, Berger E, Abella J, Wartel F, Houston M, Santinelli L, Kosmidis L, Lo C, Maxim D (2013) Proartis: probabilistically analysable realtime systems. ACM Trans Embed Comput Syst 1(2s):1–26CrossRefGoogle Scholar
 Chiou D, Chiouy D, Rudolph L, Rudolphy L, Devadas S, Devadasy S, Ang BS, Angz BS (2000) Dynamic cache partitioning via columnization. In: Proceedings of design automation conferenceGoogle Scholar
 Colin A, Puaut I (2001) A modular and retargetable framework for treebased WCET analysis. In: 13th Euromicro conference on realtime systems (ECRTS), pp 37–44Google Scholar
 CortexR4 and CortexR4F Technical Reference Manual (2010) http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.set.cortexr/index.html
 CucuGrosjean L (2013) Independence—a misunderstood property of and for probabilistic realtime systems. In: Alan Burns 60th anniversary, YorkGoogle Scholar
 CucuGrosjean L, Santinelli L, Houston M, Lo C, Vardanega T, Kosmidis L, Abella J, Mezzetti E, Quiones E, Cazorla FJ (2012) Measurementbased probabilistic timing analysis for multipath programs. In: 24th Euromicro conference on realtime systems (ECRTS), pp 91–101Google Scholar
 David L, Puaut I (2004) Static determination of probabilistic execution times. In: 16th Euromicro conference on realtime systems (ECRTS), pp 223–230, June 2004Google Scholar
 Davis RI, Santinelli L, Altmeyer S, Maiza C, CucuGrosjean L (2013) Analysis of probabilistic cache related preemption delays. In: 25th Euromicro conference on realtime systems (ECRTS)Google Scholar
 de Dinechin BD, van Amstel D, Poulhiès M, Lager G (2014) Timecritical computing on a singlechip massively parallel processor. In: Conference on Design, Automation & Test in Europe (DATE)Google Scholar
 Edgar S, Burns A (2001) Statistical analysis of WCET for scheduling. In: 22nd IEEE realtime systems symposium (RTSS ’01)Google Scholar
 Griffin D, Burns A (2010) Realism in statistical analysis of worst case execution times. In: 10th International workshop on worstcase execution time analysis (WCET’10), July 2010Google Scholar
 Griffin D, Lesage B, Burns A, Davis R (2014a) Lossy compression for static probabilistic timing analysis of random replacement caches. In: 22st International conference on realtime networks and systems (RTNS ’14)Google Scholar
 Griffin D, Lesage B, Burns A, Davis RI (2014b) Lossy compression for worstcase execution time analysis of PLRU caches. In: Proceedings of the 22nd international conference on realtime networks and systems (RTNS ’14)Google Scholar
 Grund D, Reineke J (2010) Precise and efficient FIFOreplacement analysis based on static phase detection. In: the 22nd Euromicro conference on realtime systems (ECRTS ’10), July 2010Google Scholar
 Grund D, Reineke J (2010) Toward precise PLRU cache analysis. In: 10th International workshop on worstcase execution time analysis (WCET’10), pp 28–39, July 2010Google Scholar
 Gustafsson J, Betts A, Ermedahl A, Lisper B (2010) The Mälardalen WCET benchmarks—past, present and future. In: Proceedings of the 10th international workshop on worstcase execution time analysis (WCET), pp 137–147Google Scholar
 Hahn S, Grund D (2012) Relational cache analysis for static timing analysis. In: 2012 24th Euromicro conference on realtime systems, pp 102–111Google Scholar
 Hahn S, Reineke J, Wilhelm R (2015) Towards compositionality in execution time analysis: definition and challenges. In: SIGBED Review, vol 12. ACM, New York, pp 28–36Google Scholar
 Hennessy JL, Patterson DA (2011) Computer architecture: A quantitative approach, 5th edn. Morgan Kaufmann, BurlingtonGoogle Scholar
 Holsti N, Lngbacka T, Saarinen S (2000) Using a worstcase execution time tool for realtime verification of the DEBIE software. In: Proceedings of the DASIA 2000 (data systems in aerospace) conferenceGoogle Scholar
 Huynh BK, Ju L, Roychoudhury A (2011) Scopeaware data cache analysis for WCET estimation. In: 17th Realtime and embedded technology and applications symposium (RTAS)Google Scholar
 Kosmidis L, Abella J, Quiñones E, Cazorla FJ (2013) A cache design for probabilistically analysable realtime systems. In: 16th conference on Design, Automation and Test in Europe (DATE), pp 513–518Google Scholar
 Kosmidis L, Abella J, Wartel F, Quinones E, Colin A, Cazorla F (2014) PUB: path upperbounding for measurementbased probabilistic timing analysis. In: 26th Euromicro conference on realtime systems (ECRTS)Google Scholar
 Lesage B, Griffin D, Davis R, Altmeyer S (2013) On the application of static probabilistic timing analysis to memory hierarchies. In: Realtime scheduling open problems seminar (RTSOPS)Google Scholar
 Lesage B, Griffin D, Altmeyer S, Davis R (2015a) Static probabilistic timing analysis for multipath programs. In: Realtime systems symposium (RTSS)Google Scholar
 Lesage B, Griffin D, Soboczenski F, Bate I, Davis RI (2015b) A framework for the evaluation of measurementbased timing analyses. In: 23rd International conference on real time and networks systems (RTNS)Google Scholar
 Li YT, Malik S (1997) Performance analysis of embedded software using implicit path enumeration. Trans Comput Aided Des Integr Circuit Syst 16:1477–1487CrossRefGoogle Scholar
 Liang Y, Mitra T (2008) Cache modeling in probabilistic execution time analysis. In: Proceedings of the 45th annual design automation conference (DAC), pp 319–324Google Scholar
 López J, Díaz J, Entrialgo J, García D (2008) Stochastic analysis of realtime systems underpreemptive prioritydriven scheduling. Real Time Syst 40:180–207CrossRefzbMATHGoogle Scholar
 Maxim D, Houston M, Santinelli L, Bernat G, Davis RI, CucuGrosjean L (2012) Resampling for statistical timing analysis of realtime systems. In: 20th International conference on realtime and network systems (RTNS), pp 111–120Google Scholar
 MPC8641D Integrated Host Processor Family Reference Manual (2008) http://www.nxp.com/products/microcontrollersandprocessors/powerarchitectureprocessors/integratedhostprocessors/highperformancedualcoreprocessor:MPC8641D?fpsp=1&tab=Documentation_Tab
 Muchnick SS (1997) Advanced compiler design and implementation. Morgan Kaufmann, San FranciscoGoogle Scholar
 Nemer F, Cassé H, Sainrat P, Bahsoun J.P, Michiel M D (2006) PapaBench: a free realtime benchmark. In: 6th International workshop on worstcase execution time analysis (WCET’06), vol 4 of OpenAccess Series in Informatics (OASIcs)Google Scholar
 Pasdeloup B (2014) Static probabilistic timing analysis of worstcase execution time for random replacement caches. Tech. Report, INRIA, RennesGoogle Scholar
 Peleska J, Löding H (2008) Static analysis by abstract interpretation. University of Bremen, Centre of Information Technology, BremenGoogle Scholar
 Puschner P, Koza C (1989) Calculating the maximum, execution time of realtime programs. Real Time Syst 1(2):159–176CrossRefGoogle Scholar
 Quinones E, Berger ED, Bernat G, Cazorla FJ (2009) Using randomized caches in probabilistic realtime systems. In: 21st Euromicro conference on realtime systems (ECRTS), pp 129–138Google Scholar
 Reineke J (2014) Randomized caches considered harmful in hard realtime systems. LITES 1(1):03:1–03:13MathSciNetGoogle Scholar
 Reineke J, Wachter B, Thesing S, Wilhelm R, Polian I, Eisinger J, Becker B (2006) A definition and classification of timing anomalies. In: 6th International workshop on worstcase execution time (WCET) analysisGoogle Scholar
 Spreitzer R, Plos T (2013) Cacheaccess pattern attack on disaligned AES Ttables. In: Proceedings of the 4th international conference on constructive sidechannel analysis and secure design (COSADE’13), pp 200–214Google Scholar
 Theiling H, Ferdinand C, Wilhelm R (1999) Fast and precise WCET prediction by separated cache and path analyses. Real Time Syst 18:157–179CrossRefGoogle Scholar
 Wang Z, Lee RB (2007) New cache designs for thwarting software cachebased side channel attacks. In: Proceedings of the 34th annual international symposium on computer architecture (ISCA ’07). ACM, New York, pp 494–505Google Scholar
 Wang Z, Lee RB (2008) A novel cache architecture with enhanced performance and security. In: Proceedings of the 41st annual IEEE/ACM international symposium on microarchitecture (MICRO 41), pp 83–93Google Scholar
 Wegener S (2012) Computing same block relations for relational cache analysis. In: 12th International workshop on worstcase execution time analysis, pp 25–37Google Scholar
 Wilhelm R, Engblom J, Ermedahl A, Holsti N, Thesing S, Whalley D, Bernat G, Ferdinand C, Heckmann R, Mitra T, Mueller F, Puaut I, Puschner P, Staschulat J, Stenström P (2008) The worstcase executiontime problem: overview of methods and survey of tools. ACM Trans Embed Comput Syst 7(3):1–53CrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.