1 Extensions

This paper builds upon previous work published in RTSS 2015 (Lesage et al. 2015a) with the following extensions:

  • we introduce and prove additional properties relevant to the comparison of the contribution of different cache states to the probabilistic worst-case execution time of tasks in Sect. 3;

  • an improved join transfer function, used to safely merge states from converging paths, is introduced in Sect. 5 and by construction dominates the simple join introduced in Lesage et al.  (2015a);

  • we present and prove the validity of path renaming in Sect. 6 which allows the definition of additional transformations to reduce the set of paths considered during analysis;

  • our evaluation explores new configurations in terms of both the analysis methods used and the benchmarks considered (see Sect. 7).

2 Introduction

Real-time systems such as those deployed in space, aerospace, automotive and railway applications require guarantees that the probability of the system failing to meet its timing constraints is below an acceptable threshold (e.g. a failure rate of less than \(10^{-9}\) per hour for some aerospace and automotive applications). Advances in hardware technology and the large gap between processor and memory speeds, bridged by the use of cache, make it difficult to provide such guarantees without significant over-provision of hardware resources.

The use of deterministic cache replacement policies means that pathological worst-case behaviours need to be accounted for, even when in practice they may have a vanishingly small probability of actually occurring. The use of cache with a random replacement policy means that the probability of pathological worst-case behaviours can be upper bounded at quantifiably extremely low levels, for example well below the maximum permissible failure rate (e.g. \(10^{-9}\) per hour) for the system. This allows the extreme worst-case behaviours to be safely ignored, instead of always included in the estimated worst-case execution times.

The random replacement policy further offers a trade-off between performance and cost thanks to a minimal hardware cost (Al-Zoubi et al. 2004). The policy and variants have been implemented in a selection of embedded processors (Hennessy and Patterson 2011) such as the ARM Cortex series (2010), or the Freescale MPC8641D  (2008). Randomisation further offers some level of protection against side-channel attacks which allow the leakage of information regarding the running tasks. While methods relying solely on the random replacement policy may still be circumvented (Spreitzer and Plos 2013), the definition of probabilistic timing analysis is a step towards the analysis of other approaches such as randomised placement policies (Wang and Lee 2007; 2008).

The timing behaviour of programs running on a processor with a cache using a random replacement policy can be determined using static probabilistic timing analysis (SPTA). SPTA computes an upper bound on the probabilistic Worst-Case Execution Time (pWCET) in terms of an exceedance function. This exceedance function gives the probability, as a function of all possible values for an execution time budget x, that the execution time of the program will exceed that budget on any single run. The reader is referred to Davis et al.  (2013) for examples of pWCET distributions, and to Cucu-Grosjean (2013) for a detailed discussion of what is meant by a pWCET distribution.

This paper introduces an effective SPTA for multi-path programs running on hardware that uses an evict-on-miss, random replacement cache. Prior work on SPTA for multi-path programs by Davis et al.  (2013) used a path merging approach to compute cache hit probabilities based on reuse distances. The analysis derived in this paper builds upon more sophisticated SPTA techniques for the analysis of single path programs given by Altmeyer and Davis (2014, 2015). This new analysis provides substantially improved results compared to the path merging approach. To allow the analysis of the behaviour of caches in isolation, we assume the existence of a valid decomposition of the architecture with regards to cache effects with bounded hit and miss latencies (Hahn et al. 2015).

2.1 Related work

We now set the work on SPTA in context with respect to related work on both probabilistic hard real-time systems and cache analysis for deterministic replacement policies. The methods introduced in this paper belong to the realm of analyses that estimate bounds on the execution time of a program. These bounds may be classified as either a worst-case probability distribution (pWCET) or a worst-case value (WCET).

The first class is a more recent research area with the first work on providing bounds described by probability distributions published by Edgar and Burns  (2000, 2001). The methods for obtaining such distributions can be categorised into three different families: measurement-based probabilistic timing analyses, static probabilistic timing analyses, and hybrid probabilistic timing analyses.

The second class is a mature area of research and the interested reader may refer to Wilhelm et al.  (2008) for an overview of these methods. A specific overview of cache analysis for deterministic replacement policies together with a comparison between deterministic and random cache replacement policies is provided at the end of this section.

2.1.1 Probabilistic timing analyses

Measurement-based probabilistic timing analyses (Bernat et al. 2002; Cucu-Grosjean et al. 2012) collect observations on the execution time of the task under study on the target hardware. These observations are then combined, e.g. through the use of extreme value theory (Cucu-Grosjean et al. 2012), to produce the desired worst-case probabilistic timing estimate. Extreme Value Theory may potentially underestimate the pWCET of a program as shown by Griffin and Burns  (2010). The work of Cucu-Grosjean et al.  (2012) overcomes this limitation and also introduces the appropriate statistical tests required to treat worst-case execution times as rare events. The soundness of the results produced by such methods is tied to the observed execution times which should be representative of the ones at runtime. This implies a responsibility on the user who is expected to provide input data to exercise the worst-case paths, less the analysis results in unsound estimates (Lesage et al. 2015b). These methods nonetheless exhibit the benefits of time-randomised architectures. The occurrence probability of pathological temporal cases can be bounded and safely ignored provided they meet requirements expressed in terms of failure rates.

Path upper-bounding (Kosmidis et al. 2014) defines a set of program transformations to alleviate the responsibility of the user to provide inputs which cover all execution paths. The alternative paths of conditional constructs are padded with semantic-preserving instructions and memory accesses such that any path followed in the modified program is an upper-bound of any of the original alternatives. Measurement-based analyses can then be performed on the modified program as the paths exercised at runtime upper-bound any alternative in the original application. Hence, upper-bounding creates a distinction between the original code and the measured one. It may also result in paths which are the sum of the original alternatives.

Hybrid probabilistic timing analyses are methods that apply measurement-based methods at the level of sub-programs or blocks of code and then operations such as convolution to combine these bounds to obtain a pWCET for the entire program. The main principles of hybrid analysis were introduced by Bernat et al. (2002, 2003) with execution time probability distributions estimated at the level of sub-programs. Here, dependencies may exist among the probability distributions of the sub-programs and copulas are used to describe them (Bernat et al. 2005).

By contrast, SPTAs derive the pWCET distribution for a program by analysing the structure of the program and modelling the behaviour of the hardware it runs on. Existing work on SPTA has primarily focussed on randomized architectures containing caches with random replacement policies. Initial results for the evict-on-miss (Quinones et al. 2009) and evict-on-access (Cucu-Grosjean et al. 2012; Cazorla et al. 2013) policies were derived for single-path programs. These methods use the reuse distance of each access to determine its probability of being a cache hit. These results were superseded by later work by Davis et al.  (2013) who derived an optimal lower bound on the probability of a cache hit under the evict-on-miss policy, and showed that evict-on-miss dominates evict-on-access. Altmeyer and Davis  (2014) proved the correctness of the lower bound derived in Davis et al.  (2013), and its optimality with regards to the limited information that it uses (i.e. the reuse distance). They also showed that the probability functions previously given in Kosmidis et al.  (2013) and Quinones et al.  (2009) are unsound (optimistic) for use in SPTA. In 2013, a simple SPTA for multipath programs was introduced by Davis et al.  (2013), based on path merging. With this method, accesses are represented by their reuse distances. The program is then virtually reduced to a single sequence which upper-bounds all possible paths with regards to the reuse distance of their accesses.

In 2014, more sophisticated SPTA methods for single path programs were derived by Altmeyer and Davis  (2014). They introduced the notion of cache contention, which combined with reuse distance enables the computation of a more precise bound on the probability that a given access is a cache hit. Altmeyer and Davis  (2014) also introduced a significantly more effective method based on combining exhaustive evaluation of the cache behaviour for a limited number of relevant memory blocks with cache contention. This method provides an effective trade-off between analysis precision and tractability. Griffin et al.  (2014a) introduces orthogonal Lossy compression methods on top of the cache states enumeration to improve the trade-off between complexity and precision.

Altmeyer and Davis further refined their approach to SPTA for single path programs in 2015 (Altmeyer et al. 2015), bridging the gap between contention and enumeration-based analyses. The method relies on simulation of the behaviour of a random replacement cache. As opposed to exhaustive state analyses however, focus is set at each step on a single cache state to capture the outcome across all possible states. The resulting approach offers an improved precision over contention-based methods, at a lower complexity than exhaustive state analyses.

In this paper, we build upon the state-of-the-art approach (Altmeyer and Davis 2014), extending it to multi-path programs. The techniques introduced in the following notably allow for the identification on control flow convergence of relevant cache contents, i.e. the identification of the outcomes in multi-path programs. The approach focuses on the enumeration of possible cache states at each point in the program. To reduce the complexity of such an approach, only a few blocks, identified as the most relevant, are analysed at a given time.

2.1.2 Deterministic architectures and analyses

Static timing analysis for deterministic caches (Wilhelm et al. 2008) relies on a two step approach with a low-level analysis to classify the cache accesses into hits and misses (Theiling et al. 1999) and a high-level analysis to determine the length of the worst-case path (Li and Malik 2006). The most common deterministic replacement policies are least-recently used (LRU), first-in first-out (FIFO) and pseudo-LRU (PLRU). Due to the high-predictability of the LRU policy, academic research typically focusses on LRU caches–with a well-established LRU cache analysis based on abstract interpretation (Alt et al. 1996; Theiling et al. 1999). Only recently, analyses for FIFO (Grund and Reineke 2010) and PLRU (Grund and Reineke 2010; Griffin et al. 2014b) have been proposed, both with a higher complexity and lower precision than the LRU analysis due to specific features of the replacement policies. Despite the focus on LRU caches and its analysability, FIFO and PLRU are often preferred in processor designs due to the lower implementation costs which enable higher associativities.

Recently, Reineke  (2014) observed that SPTA based on reuse distances (Davis et al. 2013) results, by construction, in less precise bounds than existing analyses based on stack distance for an equivalent system with a LRU cache (Wilhelm et al. 2008). However, this does not hold for the more sophisticated SPTA based on cache contention and collecting semantics given by Altmeyer and Davis  (2014). Analyses for deterministic LRU caches are incomparable with these analyses for random replacement caches. This is illustrated by our evaluation results. It can also be seen by considering simple examples such as a repeated sequence of accesses to five memory blocks \(\langle a,b,c,d,e,a,b,c,d,e\rangle \) with a four-way associative cache. With LRU, no hits can be predicted. By contrast, with a random replacement cache and SPTA based on cache contention, four out of the last five accesses can be assumed to have a non-zero probability of being a cache hit (as shown in Table 1 of Altmeyer and Davis 2014), hence SPTA for a random replacement cache outperforms analysis of LRU in this case. We note that in spite of recent efforts (de Dinechin et al. 2014) the stateless random replacement policies have lower silicon costs than LRU, and so can potentially provide improved real-time performance at lower hardware cost.

Early work (David and Puaut 2004; Liang and Mitra 2008) in the domain of SPTA for deterministic architectures relied for its correctness on knowledge of the probability that a specific path would be taken or that specific input data would be encountered; however, in general such assumptions may not be available. The analysis given in this paper does not require any assumption about the probability distribution of different paths or inputs. It relies only on the random selection of cache lines for replacement.

2.2 Organisation

In this paper, we introduce a set of methods that are required for the application of SPTA to multi-path programs. Section 2 recaps the assumptions and methods upon which we build. These were used in previous work (Altmeyer and Davis 2014) to upper-bound the pWCET distribution of a trace corresponding to a single path program. We then proceed by defining key properties which allows the ordering of cache states w.r.t. their contribution to the pWCET of a program (Sect. 3). We address the issue of multi-path programs in the context of SPTA in Sect. 4. This includes the definition of conservative (over-approximate) join functions to collect information regarding cache contention, possible cache contents, and the pWCET distribution at each program point, irrespective of the path followed during execution. Further improvements on cache state conservation at control flow convergence are introduced in Sect. 5. Section 6 introduces simple program transformations which improve the precision of the analysis while ensuring that the pWCET distribution of the transformed program remains sound (i.e. upper-bounds that of the original). Multi-path SPTA is applied to a selection of benchmarks in Sect. 7 and the precision and run-time of the different approaches compared. Section 8 concludes with a summary of the main contributions of the paper and a discussion of future work.

3 Static probabilistic timing analysis

In this section, we recap on state-of-the-art SPTA techniques for single path programs (Altmeyer and Davis 2014). We first give an overview of the system model assumed throughout the paper in Sect. 2.1. We further recap on the existing methods (Altmeyer and Davis 2014) to evaluate the pWCET of a single path trace using a collecting approach (Sect. 2.2) supplemented by a contention one. The pertinence of the model is discussed at the end of this section. The notations introduced in the present contributions have been summarised in Table 1.

We assume an architecture for which a valid decomposition exists with regards to the cache, such that its timing contribution can be analysed in isolation from other components (Hahn et al. 2015). Further, the overall execution time penalty emanating from cache misses and hits are assumed to be bounded by the latencies assumed by the analysis. Thus a local worst-case, a miss in the context of the cache, can be added to the local worst-case for other components to obtain a bound on the global worst case (Reineke et al. 2006). This enables analysis of the impact of the cache in isolation from other architectural features.

Table 1 Summary of introduced notations

3.1 Cache model

We assume a single level, private, N-way fully-associative cache with an evict-on-miss random replacement policy. On an access, should the requested memory block be absent from the cache then the contents of a randomly selected cache line are evicted. The requested memory block is then loaded into the selected location. Given that there are N ways, the probability of any given cache line being selected by the replacement policy is \(\frac{1}{N}\). We assume a fixed upper-bound on the hit and miss latencies, denoted by \(\mathcal {H}\) and \(\mathcal {M}\) respectively, such that \(\mathcal {H} < \mathcal {M}\). (We note that the restriction to a fully-associative cache can be easily lifted for a set-associative cache through the analysis of each cache set as an independent fully-associative cache.)

3.2 Collecting semantics

We now recap on the collecting semantics introduced by Altmeyer and Davis  (2014) as a more precise but more complex alternative to the contention-based method of computing pWCET estimates. This approach performs exhaustive cache state enumeration for a selection of relevant accesses, hence providing tight analysis results for those accesses. To prevent state explosion, at each point in the program no more than R memory blocks are relevant at the same time. The relevant accesses are ones heuristically identified as benefiting the most from a precise analysis.

A trace t is defined as an ordered sequence \([e_1,\ldots ,e_n]\) of n accesses to memory blocks, such that \(e_i = e_j\) if accesses \(e_i\) and \(e_j\) target the same memory block. If access \(e_i\) is relevant, the block it accesses will be considered relevant until the next non-relevant access to the same block. The precise approach is only applied for relevant accesses while the contention-based method outlined in Sect. 2.2.1 is used for the others, identified as \(\bot \) in the trace of relevant blocks. The set of elements in a trace becomes \(\mathbb {E}^{\bot } = \mathbb {E} \cup \{\bot \}\).

The abstract domain of the analysis is a set of cache states. A cache state is a triplet \(CS = (C,P,\mathcal {D})\) with cache contents C, a corresponding probability \(P \in \mathbb {R}, 0 < P \le 1\), and a miss distribution \(\mathcal {D}:\mathbb {N} \rightarrow \mathbb {R}\) when the cache is in state C. C is a set of at most N memory blocks picked from \(\mathbb {E}\). A cache state which holds less than N memory blocks represents partial knowledge about the cache contents without any distinction between empty lines or unknown contents.Footnote 1 The set of all cache states is denoted by \(\mathbb {CS}\). Miss distribution \(\mathcal {D}\) captures for each possible number of misses n, the probability that n misses occurred from the beginning of the program up to the current point in the program. The method computes all possible behaviours of the random cache with the associated probabilities. It is thus correct by construction as it simply enumerates all states exhaustively.

The analysis starts from the empty cache state \(\{(\emptyset , 1, \mathcal {D}_{\textit{init}})\}\) where

$$\begin{aligned} \mathcal {D}_{\textit{init}}(x) = \left\{ \begin{array}{ll} 1 &{} \quad \textit{if}\; x = 0\\ 0 &{} \quad \textit{otherwise} \end{array}\right. \end{aligned}$$
(1)

The update function u describes the update for a single cache state upon access to element \(e \in \mathbb {E}^{\bot }\). Upon accessing a relevant element \(e \ne \bot \), if e is present in the cache, its contents are left unchanged. Otherwise new cache states need to be generated considering that each element may be evicted with the same probability \(\frac{1}{N}\) (in the \(\textit{evict}\) function). A miss is accounted for in the resulting distributions \(\mathcal {D}'\) only upon misses on a relevant access. Formally:

$$\begin{aligned}&u:\mathbb {CS} \times \mathbb {E}^{\bot } \rightarrow 2^{\mathbb {CS}} \end{aligned}$$
(2)
$$\begin{aligned} u((C,P,\mathcal {D}),e)= & {} \left\{ \begin{array}{ll} \{(C,P,\mathcal {D})\} &{} \quad \textit{if}\;e \in C \wedge e \ne \bot \\ \textit{evict}((C,P,\mathcal {D}), e) &{} \quad \textit{otherwise} \end{array}\right. \end{aligned}$$
(3)
$$\begin{aligned} \textit{evict}((C,P,\mathcal {D}),e)= & {} \left\{ \begin{array}{ll} \{(C{\setminus } \{e'\} \cup \{e\}, P\cdot \frac{1}{N},\mathcal {D}')\mid e' \!\in \!\! C\} \cup \{(C \cup \{e\}, P\cdot \frac{N - |C|}{N},\mathcal {D}')\} &{} \quad \textit{if} \;e\!\ne \! \bot \\ \{(C{\setminus } \{e'\} , P\cdot \frac{1}{N},\mathcal {D}')\mid e' \!\in \! C\} \cup \{(C, P\!\cdot \frac{N - |C|}{N},\mathcal {D}')\} &{} \quad \textit{if} \;e \!=\! \bot \end{array}\right. \end{aligned}$$
(4)
$$\begin{aligned} \mathcal {D}'(x)= & {} \left\{ \begin{array}{ll} \mathcal {D}(x) &{} \quad \textit{if}\; e = \bot \\ 0 &{} \quad \textit{if}\; x = 0 \\ \mathcal {D}(x-1) &{} \quad \textit{otherwise} \end{array}\right. \end{aligned}$$
(5)

The \(\textit{evict}(s,e)\) function creates N different cache states, one per possible evicted element, some of which might represent the same cache contents. To reduce the state space, a merge operation \(\biguplus \) combines two cache states if they contain exactly the same memory blocks. If merging occurs, each distribution is weighted by its probability:

$$\begin{aligned}&\biguplus : 2^\mathbb {CS} \rightarrow 2^\mathbb {CS}\end{aligned}$$
(6)
$$\begin{aligned} \biguplus \left( \left\{ \begin{array}{c} (C_0, P_0, \mathcal {D}_0) \\ \vdots \\ (C_n, P_n, \mathcal {D}_n) \end{array} \right\} \right)= & {} \Big \{\textit{Merge}\left( \left\{ (C_i,P_i,\mathcal {D}_i) \vert C_i = C_j\right\} \right) \Big \vert 0 \le j \le n \Big \}\nonumber \\ \end{aligned}$$
(7)
$$\begin{aligned} \textit{Merge}\left( \left\{ \begin{array}{c} (C_0, P_0, \mathcal {D}_0) \\ \vdots \\ (C_n, P_n, \mathcal {D}_n) \end{array} \right\} \right)= & {} \left( C_0, \sum _{i=0}^{n} P_i, \sum _{i=0}^{n}\frac{P_i}{\sum _{k=0}^{n}P_k}\cdot \mathcal {D}_i\right) \end{aligned}$$
(8)

where \(p\cdot \mathcal {D}\) denotes the multiplication of the elements of distribution \(\mathcal {D}\), \((p\cdot \mathcal {D})(x) = p\cdot \mathcal {D}(x)\), and \(\mathcal {D}_1 + \mathcal {D}_2\) is the summation of two distributions, \((\mathcal {D}_1 + \mathcal {D}_2)(x) = \mathcal {D}_1(x) + \mathcal {D}_2(x)\).

The update function can be defined for a set of cache states using the update function u for a single cache state and the \(\uplus \) merge operator as follows:

$$\begin{aligned}&U:2^{\mathbb {CS}} \times \mathbb {E}^{\bot } \rightarrow 2^{\mathbb {CS}} \end{aligned}$$
(9)
$$\begin{aligned}&\quad U(S,e) = \biguplus \{u(\textit{CS},e)\mid \textit{CS} \in S\} \end{aligned}$$
(10)

Given \(S_{\textit{res}}\) the set of cache states at the end of the execution of a trace t, the miss distribution \(\hat{\mathcal {D}}_{\textit{miss}}\) of the relevant blocks in t is the sum of the individual distributions of each cache state weighted by their probability of occurrence:

$$\begin{aligned} \hat{\mathcal {D}}_{\textit{miss}} = \sum \{P \cdot \mathcal {D} \mid (C,P,\mathcal {D}) \in S_{\textit{res}} \} \end{aligned}$$
(11)

The corresponding execution time distribution, \(\hat{\mathcal {D}}\), can then be derived, for a trace of n accesses, as follows:

$$\begin{aligned} \hat{\mathcal {D}}\left( m \times \mathcal {M} + (n-m)\times \mathcal {H}\right) =\hat{\mathcal {D}}_{\textit{miss}}(m) \end{aligned}$$
(12)

3.2.1 Non-relevant blocks analysis

One possible naive approach for non-relevant blocks would be to classify them as misses in the cache and add the resulting latency to the previously computed distributions. The collecting approach proposed by Altmeyer and Davis  (2014) relies on the application of the contention methods to estimate the behaviour of the non-relevant blocks in a trace. Each access in a trace has a probability of being a cache hit \(P(e_i^{\textit{hit}})\), and of being a cache miss \(P(e_i^{\textit{miss}})=1-P(e_i^{\textit{hit}})\). These methods rely on different metrics to lower-bound the hit probability of each access such that the derived bound can be soundly convolved.

The reuse distance \(\textit{rd}(e)\) of element e is the maximum number of accesses to consecutively different blocks since the last access to the same block. It captures an upper-bound on the maximum number of possible evictions between two accesses to the same block, similarly to the stack distance for LRU caches. It differs from the stack distance in that accesses to the same intermediate block may thus be accounted for multiple times if they may have been evicted during the access sequence. Should there be no such prior access to the same block, the reuse distance is defined as \(\infty \). Given the set of all traces \(\mathbb {T}\) and of all elements \(\mathbb {E}\), the reuse distance is formally defined as:

$$\begin{aligned}&rd:\mathbb {E}\times \mathbb {T} \rightarrow \mathbb {N} \cup \{ \infty \}\nonumber \\&\textit{rd}(e_i,[e_1,\ldots ,e_{i-1}]) = {\left\{ \begin{array}{ll} |\{k | j< k< i \wedge e_k \ne e_{k-1}\}| &{} \quad \textit{if}\; e_i = e_j \wedge \\ &{} \quad \forall k: j< k < i, e_i \ne e_k\\ \infty &{} \quad \textit{otherwise} \end{array}\right. }\nonumber \\ \end{aligned}$$
(13)

Note that this definition of the reuse distance is a variation of the one proposed in earlier work. The revised equation (13) computes the same property, but has to discard successive accesses to the same block. Successive accesses to the same memory block lead to guaranteed cache hits under an evict-on-miss cache replacement policy. Traces are thus collapsed in Altmeyer et al. (2015) to remove all successive accesses to the same memory block. The number of cache misses is not impacted and cache hits can later be accounted for as an additional contribution to the trace. This last step is not straightforward for multi-path programs as the number of guaranteed hits varies on different paths.

Conversely, we define the forward reuse distance \(\textit{frd}(e)\) of an element e as the maximum number of possible evictions before the next access to the same block. If its block is not reused before the end of the trace, the forward reuse distance of an access is defined as \(\infty \):

$$\begin{aligned}&\textit{frd}:\mathbb {E}\times \mathbb {T} \rightarrow \mathbb {N} \cup \{ \infty \}\nonumber \\ \textit{frd}(e_i,[e_{i+1},\ldots ,e_m])= & {} {\left\{ \begin{array}{ll} |\{k | j< k< i \wedge e_k \ne e_{k-1}\}| &{} \quad \textit{if}\; e_i = e_j, \\ &{} \quad \forall k: i< k < j, e_i \ne e_k\\ \infty &{} \quad \textit{otherwise} \end{array}\right. }\nonumber \\ \end{aligned}$$
(14)

The probability of \(e_i\) being a hit is set to 0 if there are more blocks since the last access to the same block that contend for cache space than the N available lines. This is captured by the cache contention \(\textit{con}(e_i,t)\) (Altmeyer and Davis 2014) of element \(e_i\) in trace t. The definition of \(\hat{P}(e_i^{\textit{hit}})\) which denotes a lower bound on the actual probability \(P(e_i^{\textit{hit}})\) of a cache hit is as follows:

$$\begin{aligned} \hat{P}(e_i^{\textit{hit}}) = {\left\{ \begin{array}{ll} 0 &{} \quad \textit{con}(e_i,t) \ge N\\ \left( \frac{N-1}{N}\right) ^{rd(e_i,t)}&{} \quad \textit{otherwise} \end{array}\right. } \end{aligned}$$
(15)

The cache contention \(\textit{con}(e)\) (Altmeyer and Davis 2014) of element e captures the number of cache blocks which contend with e for space in the cache. It includes all potential hits and the R relevant blocks, denoted \(\textit{relevant}\_blocks\), since we have to assume they occupy a separate location in the cache. Contention depends on and contributes to the potential hits captured by \(\hat{P}(e_j^{\textit{hit}}), j < i\), and is computed from the first accesses, where \(\textit{rd}(e_i, t) = \infty \), to the last. The contention also accounts for the first miss \(e_r\) which follows the previous access to the same memory block as \(e_i\) and hence contends with \(e_i\). The replacement policy means that \(e_r\) always contends for space. The cache contention is formally defined as:

$$\begin{aligned}&\textit{con} :\mathbb {E} \times \mathbb {T} \rightarrow \mathbb {N} \cup \{\infty \}\nonumber \\&\textit{con}(e_i,t) = {\left\{ \begin{array}{ll} \infty &{} \textit{if}\; \textit{rd}(e_i,t\!)\! =\! \infty \\ \left| \left\{ e_k \vert k \in conS(e_i,t) \wedge e_k \notin relevant\_blocks \right\} \right| + R &{} \textit{otherwise} \end{array}\right. }\nonumber \\ \end{aligned}$$
(16)

with

$$\begin{aligned} \begin{array}{rl} \textit{con}S(e_i,t) &{}= \{j \mid e_j \in t \wedge \hat{P}(e_j^{\textit{hit}}) \ne 0 \wedge k< j< i \wedge e_k \\ &{}= e_i \wedge \forall x: k< x< i, e_i \ne e_x \} \\ &{}\qquad \cup \{r \mid \textit{rd}(e_i,t) \ne 0 \wedge \\ r &{}= \textit{min}(\{x \mid \hat{P}(e_x^{\textit{hit}}) = 0 \wedge k< x< i \wedge e_k \\ &{}= e_i \wedge \forall y: k< y < i, e_i \ne e_y \}) \} \end{array} \end{aligned}$$
(17)

Example We now illustrate the distinction between cache contention and reuse distance in identifying accesses with a null hit probability in (15). Consider the following sequence of accesses, on a 4 line fully-associative cache, where the reuse distance of each access is given as a super-script:

$$\begin{aligned} a,b,c,b^1,d,f,a^5,b^3,c^5,d^4,f^4 \end{aligned}$$

All second accesses to blocks a, b, c, d, and f have a non-zero chance to hit when considered in isolation. However as highlighted in Altmeyer and Davis  (2014), those cannot be simply combined as the hit probability of a block depends on the behaviour of other blocks; the last 5 accesses of the sequence, each accessing a different block, cannot hit at the same time assuming a 4 line cache. The hit probability of an access need to be set to 0 in (15) if enough blocks are inserted in cache since the last access to the same block. Should the reuse distance be considered to identify whether or not an access is a potential hit, the last occurrences of a, c, d, and f would be considered as misses.

Using cache contention, some accesses are assumed to be potential hits, occupying cache space to the detriment of others. Cache contention captures a specific but potential hit/miss scenario the occurrence of which is bounded using each access hit probability in (15). As proven in Altmeyer and Davis  (2014), the estimated hit probability of the overall sequence holds. In our example, contention identifies that a, b, and c can be kept in the cache simultaneously. Using the contention as a super-script, we have:

$$\begin{aligned} a,b,c,b^1,d,f,a^2,b^2,c^3,d^4,f^4 \end{aligned}$$

\(c^3\) implies that c may be present in cache, assuming only three other blocks may have been kept alongside it, a and b as potential cache hits, and d then replaced by f. This assumption regarding d and f is an important difference between contention and the stack distance metric used in LRU cache analysis. Using the stack distance, i.e. the number of different blocks accessed since the last access to c, d and f would be regarded as occupying a different line in cache, resulting in a guaranteed miss for c. \(d^4\) is classified as a miss: \(a^2\), \(b^2\) and \(c^3\) have been identified as potential misses, and f is a miss resulting in the eviction of the fourth and only cache line where d could be held. \(f^4\) is similarly classified as a miss.

Note that this definition of contention is an improvement on the one proposed in earlier work. Instead of accounting for each access independently, we account for their accessed blocks instead. The reasoning behind this optimisation is that if an accessed block hits more than once, it does not occupy additional lines. In the previous example, b is only accounted for once in the contention of \(a^2\) and \(c^3\). The subtle difference lies in (17) where the blocks \(e_j\) are accounted for instead of each access j individually (\(e_i = e_j\) if they access the same block).

The execution time of an element \(e_i\) can be approximated with the help of the discrete random variable \(\hat{\xi _{i}}\) which has a probability mass function (PMF) defined as:

$$\begin{aligned} \hat{\xi }_{i}(x) = {\left\{ \begin{array}{ll}\hat{P}(e_i^{\textit{hit}}) &{} \text {if } x = \mathcal {H} \\ 1-\hat{P}(e_i^{\textit{hit}}) &{} \text {if } x = \mathcal {M} \\ 0 &{} \textit{otherwise} \end{array}\right. } \end{aligned}$$
(18)

An estimated pWCET (Cucu-Grosjean 2013) distribution \(\hat{\mathcal {D}}\) of a trace, is an upper-bound on the execution time distribution \(\mathcal {D}\) induced by the randomised cache for the trace,Footnote 2 such that \(\forall v, P(\hat{\mathcal {D}} \ge v) \ge P(\mathcal {D} \ge v)\). In other words, the distribution \(\hat{\mathcal {D}}\) is greater than \(\mathcal {D}\) (López et al. 2008), denoted \(\hat{\mathcal {D}} \ge \mathcal {D}\).

The probability mass functions \(\hat{\mathcal {E}}_i\) are independent upper-bounds on the behaviour of corresponding accesses \(e_i\). An estimate for trace t can be derived by combining the probability mass function \(\hat{\mathcal {E}}_i\) for each of its composing memory accesses \(e_i\):

$$\begin{aligned} \hat{\mathcal {D}}(t) = \bigotimes _{e_i\in t}\hat{\mathcal {E}}_i \end{aligned}$$
(19)

where \(\otimes \) represents the convolution of PMFs:

$$\begin{aligned} (\hat{\xi }_i\otimes \hat{\xi }_j)(x) = \sum _{k = -\infty }^{+\infty }\hat{\xi }_i(k) \cdot \hat{\xi }_j(x-k) \end{aligned}$$
(20)

The resulting distribution for non-relevant accesses is independent of the relevant blocks considered in the cache during the collecting analysis step. A worst-case is assumed where the R blocks are always kept in cache. The distributions resulting from the two analysis steps, collecting and contention, can therefore be soundly convolved to estimate the execution time of a trace. The pWCET of a trace can then be derived by convolving the execution time distributions produced by the contention, and collecting approaches, as derived from \(\hat{\mathcal {D}}^{\textit{miss}}\).

3.3 Discussion: relevance of the model

The SPTA techniques described apply whether the contents of the memory block are instruction(s), data or both. While address computation (Huynh et al. 2011) may not be able to pinpoint the exact target of an access, e.g. for data-dependent requests, relational analysis (Hahn and Grund 2012), introduced in the context of deterministic systems, can be used to identify accesses which map to the same or different sets, and access the same or different block. Two accesses which obey the same block relation can then be replaced by accesses to the same unique element, hence improving the precision of the analysis.

The methods assume that there are no inter-task cache conflicts due to preemption, i.e. a run-to-completion semantics with non-preemptable program execution. Concurrent cache accesses are also precluded, i.e. we assume a private cache or appropriate isolation (Chiou et al. 2000).

In practice, detailed analysis could potentially distinguish between different latencies for each access, beyond \(\mathcal {M}\) and \(\mathcal {H}\), but such precise estimation of the miss latency requires additional analysis steps, e.g. analysis of the main memory (Bourgade et al. 2008). Further, to reduce the pessimism inherent in using a simple bound, particularly for the miss latency, events such as memory refresh can be accounted for as part of higher level schedulability analyses (Atanassov and Puschner 2001; Bhat and Mueller 2011).

4 Comparing cache contents

The execution time distribution of a trace in our model depends solely on the behaviour of the cache. The contribution of a cache state to the execution time of a trace thus solely depends on its initial contents. The characterisation of the relation between the initial contents of different caches allows for a comparison of their temporal contribution to the same trace. This section introduces properties and conditions that allow this comparison. They are used in later techniques to improve the selection of cache contents on path convergence, and identify paths with the worst impact on execution time.

An N-tuple represents the concrete contents of an N-way cache, such that each element corresponds to the block held by a single line. The symbol \(\_\) is used to denote an empty line. For each such concrete cache s, there is a corresponding abstract cache contents C which holds the exact same set of blocks. C might also capture uncertainty regarding the contents of some lines.

Given cache state \(s = \langle l_1,\ldots ,l_N \rangle \),Footnote 3 \(s[l_i=b]\) represents the replacement of memory block or line \(l_i\) in cache by memory block b. Note that b can only be present once in the cache, \(b \in s \Rightarrow s[l_i = b] = s\). \(s[-l_i]\) is a shorthand for \(s[l_i=\_]\) and identifies the eviction of memory block \(l_i\) from the cache. \(s[l_i=b][l_j=e]\) denotes a sequence of replacements where b first replaces \(l_i\) in s, then e replaces \(l_j\). Two cache states s and \(s'\) although not strictly identical may exhibit the same behaviour if they hold the exact same contents, e.g. \(\langle a,\_\rangle = \langle \_,a \rangle \) are represented using the same abstract contents \(\{a\}\). Under the evict-on-miss random replacement policy, there is no correlation between the physical and logical position of a block with respects to the eviction policy.

We distinguish the execution time distribution of trace t using input cache state s with the notation \(\mathcal {D}(t,s)\). The execution time distribution of the sequence [[b], t], the concatenation of access [b] to trace t, can be expressed as follows:

$$\begin{aligned} \mathcal {D}([[b],t], s=\langle l_1,\ldots ,l_N \rangle ) = {\left\{ \begin{array}{ll} \mathcal {H} + \mathcal {D}(t,s) &{} \quad \text {if}\;b\in s \\ \mathcal {M} + \sum \limits _{i\in [1,N]} \frac{1}{N} \cdot \mathcal {D}(t,s[l_i=b]) &{} \quad \text {otherwise} \end{array}\right. } \end{aligned}$$
(21)

where the sum of distributions and the product of a distribution with \(\frac{1}{N}\) are defined as per (6 7 8), and \((\mathcal {L} + \mathcal {D})(x) = \mathcal {L} + \mathcal {D}(x)\) denotes the sum of distribution \(\mathcal {D}\) with latency \(\mathcal {L}\). Upon a hit, the input cache state s is left unchanged, while evictions occur to make room for the accessed block upon a miss.

The extension of this definition to the concatenation of traces requires the identification of the outcomes of an execution, i.e. the cache state C corresponding to each possible sequence of events, along with its occurrence probability P and execution time distribution \(\mathcal {D}\):

$$\begin{aligned} \mathcal {D}([t_p,t_s], s) = \sum _{( C,P,\mathcal {D} ) \in \textit{outcomes}(t_p,s)} P \cdot \left( \mathcal {D} \otimes \mathcal {D}(t_s, C)\right) \end{aligned}$$
(22)

where \(\textit{outcomes}(t_p,s)\) is the set of cache states produced by the execution of \(t_p\) from input cache state s and \(\otimes \) is the convolution of distributions.

Theorem 1

The eviction of a block from any input cache state s cannot decrease the execution time distribution of any trace t, \(\mathcal {D}(t,s) \le \mathcal {D}(t,s[-e])\).

Proof

See Appendix.\(\square \)

Corollary 1

In the context of evict-on-miss randomised caches, for any trace, the empty state is the worst initial state over any other input cache state s, \(\mathcal {D}(t,s) \le \mathcal {D}(t,\emptyset )\).

The eviction of a block might trigger additional misses, resulting in a distribution that is no less than the one where the cache contents is left untouched. This provides evidence that the assumption upon a non-relevant access that a block in cache is evicted, as per the update function in (3 4 5), is sound. Similarly, the replacement of a block in the cache might trigger additional misses but might also result in additional hits instead upon reuse of the replacing block. The impact of such a behaviour is however bounded.

Theorem 2

The replacement of a random block in cache triggers at most one additional hit.

The distribution for any trace t from any cache state s is upper-bounded by the distribution for trace t after the replacement of a random block in s and assuming a single hit turns into a miss.

$$\begin{aligned} \mathcal {H} + \mathcal {D}(t,s) \le \mathcal {M} + \sum _{i \in [1,N]} \frac{1}{N} \cdot \mathcal {D}(t,s[l_i=e]) \end{aligned}$$
(23)

Proof

See Appendix.\(\square \)

The block selected for eviction impacts the likelihood of those additional latencies suffered during the execution of the subsequent trace. Intuitively, the closer the evicted block is to reuse, the worse the impact of the eviction. We use the forward reuse distance of blocks at the beginning of trace t, \(\textit{frd}(b,t)\) as defined in (14), to identify the blocks which are closer to reuse than others.

Theorem 3

The replacement of a block in input cache state s by one which is reused later in trace t cannot result in a decreased execution time distribution: \(\textit{frd}(b,t) \le \textit{frd}(e,t) \le \infty \wedge b \in s \wedge e \notin s \Rightarrow \mathcal {D}(t,s) \le \mathcal {D}(t,s[b=e]) \)

Proof

See Appendix.\(\square \)

5 Application of SPTA to multi-path programs

In this section, we improve upon the state-of-the-art SPTA techniques for traces (Altmeyer and Davis 2014) recapitulated in Sect. 2 and present methods for multi-path programs, that is complete control-flow graphs. A naive approach would be to compute all possible traces \(\mathcal {T}\) of a task, analyse each independently and combine their distributions. However, there are two significant problems with such an approach.

Firstly, while the merge operation (6 7 8) could be used to provide a weighted combination given the probability of each path being taken at runtime, such assumptions about path probability do not hold in general. This issue can however be resolved by taking the maximum distribution of the resulting execution-time distributions for each trace:

$$\begin{aligned} \bigodot _{t \in \mathcal {T}} \mathcal {D}(t) \end{aligned}$$
(24)

where we define the \(\odot \) operation as follows

$$\begin{aligned}&\odot :\left( \left( \mathbb {N} \rightarrow \mathbb {R} \right) \times \left( \mathbb {N} \rightarrow \mathbb {R} \right) \right) \rightarrow \left( \mathbb {N} \rightarrow \mathbb {R} \right) \end{aligned}$$
(25)
$$\begin{aligned}&\quad \mathcal {D}_a \odot \mathcal {D}_b := \mathcal {D}^H \end{aligned}$$
(26)

with

$$\begin{aligned} \mathcal {D}^H(x) = \max \left( \sum _{y \ge x}\mathcal {D}_a(y) - \sum _{y> x}\mathcal {D}^H(y), \sum _{y \ge x}\mathcal {D}_b(y) - \sum _{y > x}\mathcal {D}^H(y), 0\right) \end{aligned}$$
(27)

The \(\odot \) operator computes the least upper-bound of the complementary cumulative distribution (1-CDF) of all its operands (similar to the upper-bound depicted in Fig. 1), a maximum of distributions which is valid irrespective of the path executed at runtime. By construction the following properties hold

$$\begin{aligned}&\mathcal {D}_a \odot \mathcal {D}_b \ge \mathcal {D}_a \wedge \mathcal {D}_a \odot \mathcal {D}_b \ge \mathcal {D}_b \end{aligned}$$
(28)
$$\begin{aligned}&\quad \mathcal {D}_a \le \mathcal {D}_b \Rightarrow \mathcal {D}_a \odot \mathcal {D}_b = \mathcal {D}_b \end{aligned}$$
(29)

Secondly, the number of distinct traces is exponential in the number of control flow divergences, conditional constructs and loop iterations, which means that this naive approach is computationally intractable. A standard data-flow analysis is also problematic, since it is not possible to assign to each instruction a corresponding contribution to the execution time distribution.

Our analysis on control-flow graphs resolves these problems. It relies on the collecting and the contention approaches for relevant and non-relevant blocks respectively, as per the cache collecting approach on traces given by Altmeyer and Davis  (2014). First, the loops in the control-flow graph are unrolled. This allows the implementation of the following steps, the computation of cache contention, the identification of relevant blocks and the cache collection, to be performed as simple forward traversals of the control flow graph. Approximation of the possible incoming states on path convergence keeps the analysis tractable. Finally, the contention and collecting distributions are combined using convolution.

5.1 Program representation

We represent the possible paths in a program using a control-flow graph (CFG), that is a directed graph \(G = (V,L,v_s,v_e)\) with a finite set V of nodes, a set \(L \subseteq V \times V\) of edges, a start node \(v_s \in V\) and an end node \(v_e \in V\). Each node v corresponds to an element in \(\mathbb {E}\) accessed at node v. A path \(\pi \) from node \(v_{1}\) to node \(v_{k}\) is a sequence of nodes \(\pi = [v_1,v_2, \dots , v_{k-1}, v_{k}]\) where \(\forall i:(v_{i}, v_{i+1}) \in L \) and defines a corresponding trace. By extension, \([\pi , \pi ']\) denotes the path composed of path \(\pi \) followed by path \(\pi '\). Given a set of nodes \(V'\), the symbol \(\varPi (V')\) denotes the set of all paths with nodes that are included exclusively in \(V'\), and \(\varPi (G) \subseteq \varPi (V)\) the set of all paths of CFG G from \(v_s\) to \(v_e\). Similarly to traces, the pWCET \(\hat{\mathcal {D}}(G)\) of a program is the least upper-bound on the execution time distributions (pET) of all possible paths. Hence, \(\forall \pi \in \varPi (G), \hat{\mathcal {D}}(G) \ge \mathcal {D}(\pi )\). Figure 1 illustrates this relation using the 1-CDF \((F(x) = P(\mathcal {D} \ge x))\) of different execution time distributions and a valid pWCET.

Fig. 1
figure 1

Relation between the execution time distribution of different paths (pET) and the pWCET of a program

We say that a node \(v_d\) dominates \(v_n\) in the control-flow graph G if every path from the start node \(v_s\) to \(v_n\) goes through \(v_d\), \(v_s \rightarrow ^* v_n = v_s \rightarrow ^* v_d \rightarrow ^* v_n\), where \(v_s \rightarrow ^* v_d \rightarrow ^* v_n\) is the set of paths from \(v_s\) to \(v_n\) through \(v_d\). Similarly, a node \(v_p\) post-dominates \(v_n\) if every path from \(v_n\) to the end node \(v_e\) goes through \(v_p\), \(v_n \rightarrow ^* v_e = v_n \rightarrow ^* v_p \rightarrow ^* v_e\). We refer to the set of dominators and post-dominators of node \(v_n\) as \(\textit{dom}(v_n)\) and \(post\text {-}{} \textit{dom}(v_n)\) respectively.

We assume that the program always terminates. Bounded recursion and loop iterations are requirements to ensure this termination property of the analysed application. The additional restrictions described below are for the most part tied to the WCET analysis framework (Wilhelm et al. 2008) and not exclusive to the new method. These are reasonable assumptions for the software in critical real-time systems.

Any cycle in the CFG must be part of a natural loop. We define a natural loop \(l=(v_h, V_l)\) in G with a header \(v_h \in V\) and a finite set of nodes \(V_l \subseteq V\). Considering the example in Fig. 2, b is the head of the loop composed of accesses \(V_l = \{b,d,c,e\}\). The header is the single entry-point of the loop, \(\forall v_n \in V_l, v_h \in \textit{dom}(v_n)\). Conversely, a natural loop may exhibit multiple exits, e.g. as a result of break constructs. Loop l contains at least one back edge to \(v_h\), an edge whose end is a dominator of its source \(\exists v_b \in V_l, (v_b,v_h)\in L\). All nodes in the loop can reach one of its back edges without going through the header \(v_h\). The transition from the header \(v_h\) of loop l to one of its nodes \(v_n \in V_l\) begins an iteration of the loop. The maximum number of consecutive iterations of each loop, iterations which are not separated by the traversal of a node outside \(V_l\), is assumed to be upper-bounded by \(\textit{max}\text {-}{} \textit{iter}(l,\textit{ctx})\). The value of \(\textit{max}\text {-}{} \textit{iter}(l,\textit{ctx})\) might change depending on the context \(\textit{ctx}\), call stack and loop iteration, of loop l, e.g. to capture triangular loops. This guarantees a finite number of paths in the program.

Fig. 2
figure 2

Simple do-while loop structure with an embedded conditional. b is the loop head, with its body comprising \(\{b,c,d,e\}\) and the e to b edge as the back-edge. e and c are both valid exits

Calls are also subject to a small set of restrictions to guarantee the termination of the program. Recursion is assumed to be bounded, that is cycles or repetitions in the call graph of the analysed application must have a maximum number of iterations, similarly for loops in the control flow. Function pointers can be represented as multiple targets attached to a single call. Here, the set of target functions must be exact or an over-estimate of the actual ones, so as to avoid unsound estimates which do not take all valid paths into account.

5.2 Complete loop unrolling

In the first analysis step, we conceptually transform the control-flow graph into a directed acyclic graph by loop unrolling and function inlining (Muchnick 1997). In contrast to the naive approach of enumerating all possible traces, analysis through complete loop unrolling has linear rather than exponential complexity with the number of loop iterations.

Loop unrolling and function inlining are well-known techniques to improve the precision of data-flow analyses. A complete physical unrolling that removes all back-edges significantly increases the size of the control-flow graph. A virtual unrolling and inlining is instead performed during analysis such that calls and iterations are processed as required by the control flow. The analysis then distinguishes the different call and iteration contexts of a vertex. In either case, the size of the graph explored during analysis and its complexity scales with the number of accesses in the program under consideration.

Unrolling simplifies the analysis and significantly improves the precision. As opposed to state of the art analyses for deterministic replacement policies (Alt et al. 1996), the analysis of random caches through cache state enumeration does not rely on the computation of a fixpoint. The abstract domain for the analysis is by nature growing with every access since it includes the estimated distribution of misses. Successive iterations increase the probability of blocks in the loop’s working set being in the cache, and in turn increase the likelihood of hits in the next iteration. The exhaustive analysis, if not supplemented by other methods, must process all accesses in the program.

We assume in the following that unrolling is performed on all analysed programs. Section 6.4.2 discusses preliminary work to bypass this restriction. The analysis of large loops, with many predicted iterations, can be broken down into the analysis of a single iteration or groups thereof provided a sound upper-bound of the input state is used. The contributions of different segments are then combined to compute that of the complete loop or program. Such an upper-bound input can be derived as an example using cache state compression (Griffin et al. 2014a) to remove low value information. The definition of techniques to exploit the resulting trade-off between precision and analysis complexity is left as future work.

5.3 Reuse distance/cache contention on CFG

To extend the concept of reuse distance to control-flow graphs, we lift the definition from a single trace to all traces and take the maximal reuse distance of all possible traces ending in the node v:

$$\begin{aligned}&rd^G:\mathbb {V} \rightarrow \mathbb {N} \cup \{ \infty \} \end{aligned}$$
(30)
$$\begin{aligned}&\quad \textit{rd}^G(v) = \max _{\pi = [v_s,\ldots ,v]}\left( rd(v,\pi )\right) \end{aligned}$$
(31)

The cache contention is extended accordingly:

$$\begin{aligned}&\textit{con}^G:\mathbb {V} \rightarrow \mathbb {N} \end{aligned}$$
(32)
$$\begin{aligned}&\quad \textit{con}^G(v) = \max _{\pi = [v_s,\ldots ,v]}\left( con(v,\pi )\right) \end{aligned}$$
(33)

An upper-bound of both metrics for each access can be computed through a forward data flow analysis. The reuse distance analysis uses the maximum of the possible values on path convergence. Similarly, we lift the definition of the forward reuse distance to control-flow graphs. It can be computed through a backward data flow analysis. The contention for each block at each point in the program is computed through a forward data flow analysis. The computation of the contention relies on the estimation of the set of contending cache blocks. Its analysis domain is more complex than the reuse distance as different sets of contending blocks may arise on different paths. The analysis tracks all such sets from incoming paths, as long as they are conclusive to a potential cache hit, i.e. all sets are smaller than the associativity of the cache, and not included into each other, i.e. one does not upper-bound the other.

We then traverse the unrolled control-flow graph in reverse post-order, compute the distributions with the contention-based approach, and use the maximum distribution on path convergence, with the maximum operator \(\odot \) as the join operator.

5.4 Selection of relevant blocks

The selection of relevant blocks in Altmeyer and Davis  (2014) also needs to be modified to accommodate for a control-flow graph. Cache state enumeration is only performed for relevant accesses, ensuring more precise analysis results for the selected accesses. Earlier work (Altmeyer and Davis 2014) relied on an absolute set of R relevant blocks for the whole trace. Instead, we only restrict ourselves to at most R relevant blocks at any point in the program. Given a position in the control-flow, the heuristic tracks the R blocks with the shortest lifespan, i.e. the shortest distance between their last and next access. Such accesses are among the most likely to be kept in the cache and benefit from a precise estimate of their hit probability through state enumeration. Note that this heuristic relies on a lower bound on the lifespan of blocks instead of an upper bound.

The R blocks with the smallest lifespan are analysed using the collecting semantics, as they are the most likely to be kept in cache. For each of these blocks b, the access prior to b must ensure its insertion in the cache during analysis. As such, the access needs to be marked as relevant, included in the \(\textit{relevant}\_\textit{accesses}\) set, and excluded from accesses contributing to contention. The computation of cache contention is modified to account for relevant accesses instead of blocks:

$$\begin{aligned} \textit{con}(e_i,t) = {\left\{ \begin{array}{ll} \infty &{} \textit{if}\; \textit{rd}(e_i,t) = \infty \\ \left| \left\{ e_k \vert k \in \textit{con}S(e_i,t) \wedge k \notin \textit{relevant}\_\textit{accesses} \right\} \right| + R&{} \textit{otherwise} \end{array}\right. } \end{aligned}$$
(34)

5.5 Approximation of cache states

We assume no information about the probability of taking one path or another, hence the join operator must combine cache states in such a way that the resulting state is an over-approximation of all incoming paths, i.e. it contains the same or degraded information. To capture this property, we introduce the partial ordering \(\sqsubseteq \) between a cache state and a set thereof such that \(s \sqsubseteq S_b\) implies that \(S_b\) holds more pessimistic information than s, resulting in more pessimistic timing estimates. We overload this operator to relate sets of cache states where \(S_a \sqsubseteq S_b\) implies that \(S_b\) holds more pessimistic information than \(S_a\). More formally, the \(\sqsubseteq \) notation (Peleska and Löding 2008) identifies \(S_b\) as an upper-bound of \(S_a\) in \(2^{\mathbb {CS}}\).

Consider a simple cache state \(s = (\{a,b\},0.5,\mathcal {D})\). Intuitively, the information represented by \(s_{a} = (\{a\},0.5,\mathcal {D})\) is more pessimistic than that captured by s, \(s\sqsubseteq {s_a}\). Conversely, \(s_c = (\{a,c\},0.5,\mathcal {D})\) holds less pessimistic information regarding c, so \(s \not \sqsubseteq s_c\). The set \(S = \{ (\{a\},0.25,\mathcal {D}), (\{b\},0.25,\mathcal {D}) \}\) also approximates s, \(s \sqsubseteq S\); the knowledge that a and b are both present in the cache (s) is reduced to guarantees only about the presence of either a or b in S. As a consequence, the sequence of accesses abab will trigger more misses starting from states in S, than from state s. Assuming \(\mathcal {D} < \mathcal {D}'\), then \(s' = (\{a,b\},0.5,\mathcal {D}')\) holds more pessimistic information than s, \(s \sqsubseteq s'\).

The intuition behind the approximation of a cache state is that the information it captures is further diluted into a single cache state or a set of cache states. The relation \(s \sqsubseteq S\) holds if the set of cache states S approximates cache state \(s = (C,P,\mathcal {D})\). In other words, (i) S is as likely to occur, (ii) all blocks known to be in states of S are present in s, and (iii) the contribution of S to the pWCET is greater than or equal to the contribution \(\mathcal {D}\) of s. We formally define \(s \sqsubseteq S\) as follows:

$$\begin{aligned} (C,P,\mathcal {D}) \sqsubseteq S \Rightarrow \left( \!P\! =\! \left( \sum _{(C',P',\mathcal {D}')\!\in \! S} P'\right) \right) \wedge \left( \forall (C',P',\mathcal {D}')\!\in \! S, C \supseteq C' \wedge \mathcal {D} \le \mathcal {D}' \right) \end{aligned}$$
(35)

By extension, the over-approximation of a set of cache states is the composition of approximations \(F(s) \in 2^\mathbb {CS}\) of each element s in the set. We formally define the \(\sqsubseteq \) partial ordering between sets of cache states \(S_a \in 2^\mathbb {CS}\) and \(S_b \in 2^\mathbb {CS}\) as follows:

$$\begin{aligned} S_a \sqsubseteq S_b \Rightarrow \exists F:\mathbb {CS} \rightarrow 2^{\mathbb {CS}}, \left( \forall s \in S_a, s \sqsubseteq F(s)\right) \wedge S_b = \biguplus _{s\in S_a} F(s) \end{aligned}$$
(36)

A join function \(\sqcup \) is valid if given any set of cache states \(S_a \in 2^{\mathbb {CS}}\) and \(S_b \in 2^{\mathbb {CS}}\), \(S_a \sqsubseteq (S_a \sqcup S_b)\) and \(S_b \sqsubseteq (S_a \sqcup S_b)\). An optimal join function \(\sqcup \) should return the least upper-bound of its parameters, i.e. the smallest state which upper-bounds all its inputs. Our definition of the \(\sqsubseteq \) operator is however independent of the executed path: \(S_a\) and \(S_b\) may admit multiple upper-bounds incomparable to each other. The definition of an optimal join function would require a more complete ordering, taking into account the upcoming sequence of accesses to order sets of cache states depending on the likelihood their contents are reused. Optimality would still be challenged in multiple path applications where different paths stem from the join.

To prove over-approximation results in more pessimistic timing estimates, we derive the execution time distribution of a trace t using the set of input cache states S from its definition for a single state and the concatenation of traces respectively in (21) and (22):

$$\begin{aligned} \mathcal {D}(t, S) = \sum _{( C',P',\mathcal {D}' ) \in S} P' \cdot \left( \mathcal {D}' \otimes \mathcal {D}(t, C')\right) \end{aligned}$$
(37)

where the sum of distributions and the product of a distribution with P are defined as per (6 7 8), and \(\otimes \) is the convolution of distributions.

The definition of over-approximations and their contribution to the execution time distribution of a trace relies on the merge \(\uplus \) and convolution \(\otimes \) operators defined respectively in (6 7 8) and (20). Both offer properties used in the evaluation of the contribution of their operands. The convolution operator preserves the relative ordering between its inputs, and the merge operation adds the contribution of its operands.

Lemma 1

The convolution operation preserves the ordering between execution time distributions:

$$\begin{aligned} \mathcal {D} \le \mathcal {D}' \Rightarrow \mathcal {D} \otimes \mathcal {A} \le \mathcal {D}' \otimes \mathcal {A} \end{aligned}$$

Proof

See Appendix.\(\square \)

Lemma 2

The contributions of merged sets of cache states S and A is the sum of their individual contributions:

$$\begin{aligned} \forall t, \mathcal {D}(t, S) + \mathcal {D}(t, A) = \mathcal {D}(t, S \uplus A) \end{aligned}$$

Proof

See Appendix.\(\square \)

Theorem 4

The over-approximation \(S_b\) of a set of cache states \(S_a\) holds more pessimistic information than \(S_a\),

$$\begin{aligned} \forall t, S_a \sqsubseteq S_b \Rightarrow \mathcal {D}(t, S_a) \le \mathcal {D}(t, S_b) \end{aligned}$$

Proof

The relation between \(S_b\) and \(S_a\), defined in (36), implies the existence of an approximation function F for the cache states in \(S_a\) such that:

$$\begin{aligned} (\forall s \in S_a, s \sqsubseteq F(s)) \wedge S_b = \biguplus _{s \in S_a} F(s) \end{aligned}$$
(38)

From (38) and (35), we know that each cache contents \(C'\) in the approximation \(F(s) = (C',P',\mathcal {D}')\) is included in the contents C of cache state \(s = (C,P,\mathcal {D})\). \(C'\) can thus be derived by evicting blocks from C. From Theorem 1 we can infer:

$$\begin{aligned} \forall (C, P, \mathcal {D}) \in S_a, \forall (C', P', \mathcal {D}') \in F((C,P,D)), \mathcal {D}(t, C) \le \mathcal {D}(t, C') \end{aligned}$$
(39)

From Lemma 1, we can convolve both sides of the inequality with the same distribution \(\mathcal {D}\):

$$\begin{aligned} \forall (C, P, \mathcal {D}) \in S_a, \forall (C', P', \mathcal {D}') \in F((C,P,D)), \mathcal {D} \otimes \mathcal {D}(t, C) \le \mathcal {D} \otimes \mathcal {D}(t, C') \end{aligned}$$
(40)

Approximate distributions \(\mathcal {D}'\) in F(s) are also by definition greater than their counterpart \(\mathcal {D}\) in s. We can similarly factor \(\mathcal {D}(t,C)\) into both sides of inequality \(\mathcal {D} \le \mathcal {D}'\):

$$\begin{aligned} \forall (C, P, \mathcal {D}) \in S_a, \forall (C', P', \mathcal {D}') \in F((C,P,D)), \mathcal {D} \otimes \mathcal {D}(t, C') \le \mathcal {D}' \otimes \mathcal {D}(t, C') \end{aligned}$$
(41)

By transitivity of the \(\le \) operator, we can compare the contribution to the execution time distribution of \(s = (C,P,\mathcal {D})\) and each of the corresponding approximations in F((CPD)). That is a comparison between the leftmost term in (40) and rightmost term in (41) through \(\mathcal {D} \otimes \mathcal {D}(t, C')\):

$$\begin{aligned} \forall (C, P, \mathcal {D}) \in S_a, \forall (C', P', \mathcal {D}') \in F((C,P,D)), \mathcal {D} \otimes \mathcal {D}(t, C) \le \mathcal {D}' \otimes \mathcal {D}(t, C') \end{aligned}$$
(42)

We multiply both sides of the inequality by the positive occurrence probability \(P'\):

$$\begin{aligned}&\forall (C, P, \mathcal {D}) \in S_a, \forall (C', P', \mathcal {D}') \in F((C,P,D)), P' \cdot (\mathcal {D} \otimes \mathcal {D}(t, C))\nonumber \\&\quad \le P' \cdot (\mathcal {D}' \otimes \mathcal {D}(t, C')) \end{aligned}$$
(43)

The property holds for each approximation in F(s) and can be extended to their sum:

$$\begin{aligned} \forall (C, P, \mathcal {D})\in & {} S_a, \sum _{(C', P', \mathcal {D}') \in F((C,P,D))} P' \cdot (\mathcal {D} \otimes \mathcal {D}(t, C))\nonumber \\\le & {} \sum _{(C', P', \mathcal {D}') \in F((C,P,D))} P' \cdot \mathcal {D}' \otimes \mathcal {D}(t, C') \end{aligned}$$
(44)

From (35) and (38), a state \(s \in S_a\) has the same occurrence probability as its approximation F(s):

$$\begin{aligned} \forall (C, P, \mathcal {D})\in S_a, P \cdot (\mathcal {D} \otimes \mathcal {D}(t, C)) \le \sum _{(C', P', \mathcal {D}') \in F((C,P,D))} P' \cdot \mathcal {D}' \otimes \mathcal {D}(t, C') \end{aligned}$$
(45)

Both terms of the inequality correspond to the contribution of a set of cache states to the execution time distribution of trace t as per (37):

$$\begin{aligned} \forall (C, P, \mathcal {D})\in S_a, P \cdot (\mathcal {D} \otimes \mathcal {D}(t, C)) \le \mathcal {D}(t, F((C,P,D))) \end{aligned}$$
(46)

The property holds for any cache state \(s \in S_a\) and can be extended to their sum such that:

$$\begin{aligned} \sum _{(C,P,D) \in S_a} P \cdot (\mathcal {D} \otimes \mathcal {D}(t, C)) \le \sum _{s \in S_a} \mathcal {D}(t, F(s)) \end{aligned}$$
(47)

From Lemma 2, the inequality also holds for the merge \(\uplus \) across \(S_a\) of the approximations F(s):

$$\begin{aligned} \sum _{(C,P,D) \in S_a} P \cdot (\mathcal {D} \otimes \mathcal {D}(t, C)) \le \mathcal {D}\left( t, \biguplus _{s \in S_a} F(s)\right) \end{aligned}$$

By definition of \(S_b\) in (38) and the application of (37) to \(S_a\), we conclude that:

$$\begin{aligned} \forall t \in \mathbb {T}, \mathcal {D}(t, S_a) \le \mathcal {D}(t, S_b) \end{aligned}$$

\(\square \)

The \(\sqsubseteq \) relation defines a partial ordering between two sets of cache states \(S_a\) and \(S_b\). Namely, \(S_a \sqsubseteq S_b\) implies that \(S_b\) holds more pessimistic information than \(S_a\). In other words, the execution of any trace from \(S_b\) results in a larger execution time distribution than the execution of the same trace from \(S_a\). This provides sufficient ground for the definition of a sound join operation, one that upper-bounds the upcoming contribution of cache states coming from different paths.

5.6 Join operation for cache collecting

We traverse the (directed acyclic) graph in reverse post-order and compute the set of cache states at each program point. The join operator \(\bigsqcup \) describes the combination of two data-flow states from two different sub paths.

Let \(S_a\) and \(S_b\) be the sets of cache states from the two merging paths. We first define the set of common memory blocks \(\mathbb {M}^{S_a \cap S_b}\), and then restrict \(S_a\) and \(S_b\) to this set:

$$\begin{aligned} \mathbb {M}^{S_a \cap S_b}= & {} \left( \bigcup _{(C_a,P_a,\mathcal {D}_a) \in S_a} C_a \right) \cap \left( \bigcup _{(C_b,P_b,\mathcal {D}_b) \in S_b} C_b \right) \end{aligned}$$
(48)
$$\begin{aligned} S'_a= & {} \biguplus \{(C_a \cap \mathbb {M}^{S_a \cap S_b}, P_a, \mathcal {D}_a) \vert (C_a,P_a,\mathcal {D}_a) \in S_a\} \end{aligned}$$
(49)
$$\begin{aligned} S'_b= & {} \biguplus \{(C_b \cap \mathbb {M}^{S_a \cap S_b}, P_b, \mathcal {D}_b) \vert (C_b,P_b,\mathcal {D}_b) \in S_b\} \end{aligned}$$
(50)

\(S_a'\) and \(S_b'\) are safe over-approximations of \(S_a\) and \(S_b\) respectively. They only contain memory blocks common to both sets of cache states, which can therefore be included in the joined set of cache states.

The set H contains all cache states common to both sets \(S'_a\) and \(S'_b\), with the minimum probability of \(P_a\) and \(P_b\), and a miss distribution given by the maximum of the individual distributions \(\mathcal {D}_a\) and \(\mathcal {D}_b\):

$$\begin{aligned} H =\{(C, \min (P_a,P_b), \mathcal {D}_a\odot \mathcal {D}_b) \vert (C,P_a,\mathcal {D}_a) \in S'_a \wedge (C,P_b,\mathcal {D}_b) \in S'_b \wedge C \ne \emptyset \} \end{aligned}$$
(51)

We need to collect the remaining cache states that are (i) contained in \(S'_a\) but not in \(S'_b\), or (ii) are common to both sets, but have a higher probability in \(S'_a\) than in \(S'_b\):

$$\begin{aligned}&\begin{array}{rl} \hat{H}_a = &{} \{(\emptyset , P_a, \mathcal {D}_a) \vert (C,P_a,\mathcal {D}_a)\! \in \!S'_a \!\wedge \!C \ne \emptyset \wedge \not \exists (P_b, \mathcal {D}_b), (C,P_b,\mathcal {D}_b) \in S'_b \} \\ \uplus &{} \{(\emptyset , P_a \!-\! P_b, \mathcal {D}_a) \vert (C,P_a,\mathcal {D}_a)\!\! \in \!S'_a \wedge (C,P_b,\mathcal {D}_b)\!\! \in \! S'_b \!\wedge \! C \!\ne \!\emptyset \!\wedge P_a\! >\!\! P_b\} \\ \uplus &{} \{(\emptyset ,P,\mathcal {D}) \vert (\emptyset ,P,\mathcal {D}) \in S'_a\} \end{array}\nonumber \\ \end{aligned}$$
(52)
$$\begin{aligned}&\begin{array}{rl} \hat{H}_b = &{} \{(\emptyset , P_b, \mathcal {D}_b) \vert (C,P_b,\mathcal {D}_b) \in S'_b \wedge C \ne \emptyset \wedge \not \exists (P_a, \mathcal {D}_a), (C,P_a,\mathcal {D}_a) \in S'_a \} \\ \uplus &{} \{(\emptyset , P_b \!-\! P_a, \mathcal {D}_b) \vert (C,P_b,\mathcal {D}_b)\! \in \! S'_b \!\wedge \! (C,P_a,\mathcal {D}_a)\! \in \! S'_a \wedge C \ne \emptyset \!\wedge \! P_b \!>\! \!P_a\} \\ \uplus &{} \{(\emptyset ,P,\mathcal {D}) \vert (\emptyset ,P,\mathcal {D}) \in S'_b\} \end{array}\nonumber \\ \end{aligned}$$
(53)

\(\hat{H}_a \) and \(\hat{H}_b \) both contain exactly one element with the same probability.

$$\begin{aligned} \hat{H} = \{ (\emptyset , P, \mathcal {D}_a \odot \mathcal {D}_b) \vert (\emptyset , P, \mathcal {D}_a) \in \hat{H}_a \wedge (\emptyset , P, \mathcal {D}_b) \in \hat{H}_b \} \end{aligned}$$
(54)

\(H\uplus \hat{H}\) is a safe over-approximation of both \(S'_a\) and \(S'_b\) with regards to the ordering defined in (36). We can define a function \(F_a\), which gives an over-approximation of each element of \(S'_a\) such that \((H\uplus \hat{H}) = \uplus _{s_a\in S'_a} F_a(s_a)\), as follows:

$$\begin{aligned} F_a(C,P_a,\mathcal {D}_a) = {\left\{ \begin{array}{ll} \{(\emptyset , P_a, \mathcal {D}_a)\} &{} \, \textit{if} \;C = \emptyset \\ \{(\emptyset , P_a, \mathcal {D}_a)\} &{} \, \textit{if}\; \not \exists (C,P_b,\mathcal {D}_b) \in S'_b \\ \{(C, P_b, \mathcal {D}_a\odot \mathcal {D}_b)\} \cup \{(\emptyset , P_a-P_b, \mathcal {D}_a)\} &{}\, \textit{if}\; \exists (C,P_b,\mathcal {D}_b) \in S'_b \wedge P_a > P_b\\ \{(C, P_a, \mathcal {D}_a\odot \mathcal {D}_b)\} &{}\, \textit{if}\; \exists (C,P_b,\mathcal {D}_b) \in S'_b \wedge P_a \le P_b \end{array}\right. }\nonumber \\ \end{aligned}$$
(55)

We define the over-approximation function \(F_b\) for elements in \(S'_b\) analogously:

$$\begin{aligned} F_b(C,P_b,\mathcal {D}_b) = \left\{ \begin{array}{ll} \{(\emptyset , P_b, \mathcal {D}_b)\} &{} \, \textit{if}\; C = \emptyset \\ \{(\emptyset , P_b, \mathcal {D}_b)\} &{} \, \textit{if}\; \not \exists (C,P_a,\mathcal {D}_a) \in S'_a \\ \{(C, P_a, \mathcal {D}_a\odot \mathcal {D}_b)\} \cup \{(\emptyset , P_b-P_a, \mathcal {D}_b)\} &{} \, \textit{if}\; \exists (C,P_a,\mathcal {D}_a) \in S'_a \wedge P_b > P_a\\ \{(C, P_b, \mathcal {D}_a\odot \mathcal {D}_b)\} &{} \, \textit{if}\; \exists (C,P_a,\mathcal {D}_a) \in S'_a \wedge P_b \le P_a \end{array}\right. \nonumber \\ \end{aligned}$$
(56)

The join operation is defined as follows:

$$\begin{aligned} S_a \bigsqcup S_b = H \uplus \hat{H} \end{aligned}$$
(57)

Example 1 As an illustration, let us consider the state of a 4-way associative cache upon the convergence of two paths \(\pi _a = [a,b,c]\) and \(\pi _b = [b,c,a,d]\). The resulting set of cache states are denoted by \(S_a\) and \(S_b\) respectively.

\(S_a\)

\(S_b\)

\((\{a,b,c\},24/64,\mathcal {D})\)

\((\{a,b,c,d\}, 6/64,\mathcal {D})\)

\((\{a,b,d\}, 12/64,\mathcal {D})\)

\((\{a,c,d\}, 18/64,\mathcal {D})\)

\((\{b,c,d\}, 6/64,\mathcal {D})\)

\((\{a,c\}, 12/64,\mathcal {D})\)

\((\{a,d\}, 12/64,\mathcal {D})\)

\((\{b,c\}, 24/64,\mathcal {D})\)

\((\{b,d\}, 3/64,\mathcal {D})\)

\((\{c,d\}, 6/64,\mathcal {D})\)

\((\{c\}, 4/64,\mathcal {D})\)

\((\{d\}, 1/64,\mathcal {D})\)

The cache states in \(S_a\) and \(S_b\) can be reduced to only keep their common blocks \(\mathbb {M}^{S_a \cap S_b} = \{a,b,c\}\). Common states are merged together:

\(S'_a\)

\(S'_b\)

\((\{a,b,c\},24/64,\mathcal {D})\)

\((\{a,b,c\}, 6/64,\mathcal {D})\)

\((\{a,b\}, 12/64,\mathcal {D})\)

\((\{a,c\}, 12/64,\mathcal {D})\)

\((\{a,c\}, 18/64,\mathcal {D})\)

\((\{b,c\}, 24/64,\mathcal {D})\)

\((\{b,c\}, 6/64,\mathcal {D})\)

\((\{a\}, 12/64,\mathcal {D})\)

\((\{b\}, 3/64,\mathcal {D})\)

\((\{c\}, 4/64,\mathcal {D})\)

\((\{c\}, 6/64,\mathcal {D})\)

\((\{\}, 1/64,\mathcal {D})\)

The set of common cache states H, with their minimal, guaranteed probability, is defined as \(H = \{(\{a,b,c\},6/64,\mathcal {D}), (\{a,c\},12/64,\mathcal {D}), (\{b,c\},6/64,\mathcal {D}), (\{c\},4/64,\mathcal {D})\}\).

There is no guarantee about the remaining states in \(S'_a\) and \(S'_b\) or their occurrence probability, they need to be approximated with the empty cache state:

\(\hat{C}_a\)

\(\hat{C}_b\)

\((\{ \},18/64,\mathcal {D})\)

\((\{ \}, 12/64,\mathcal {D})\)

\((\{ \}, 6/64,\mathcal {D})\)

\((\{ \}, 18/64,\mathcal {D})\)

\((\{ \}, 12/64,\mathcal {D})\)

\((\{ \}, 3/64,\mathcal {D})\)

\((\{ \}, 2/64,\mathcal {D})\)

\((\{\}, 1/64,\mathcal {D})\)

Hence, the result of the join operation on the convergence of paths \(\pi _a\) and \(\pi _b\) is given by:

\(S_a \bigsqcup S_b \)

\((\{a,b,c\},6/64,D)\)

\((\{a,c\},12/64,D)\)

\((\{b,c\},6/64,D)\)

\((\{c\},4/64,D)\)

\((\{\},36/64,D)\)

6 Improving on the join operation

The basic join operation introduced in the previous section focuses on the conservation of common cache states. Others, because their contents differ or their occurrence is bounded on alternative paths, are merged into the empty state. This results in a safe estimate of the information gathered from different paths. Yet, the method exhibits some limitations with regards to the information it conserves; the probability of occurrence of some blocks in cache, which we refer to as their capacity, is lost during the join process. We introduce a join function based on conserving this additional capacity of cache states. The function degrades the information about the presence of blocks in a cache to allocate, in a sound manner, its occurrence probability to a more pessimistic state. We first present a ranking heuristic used to identify the cache states to which capacity should be allocated to in priority in Sect. 5.1. The improved capacity-conserving join is itself presented in Sect. 5.2.

6.1 Ranking cache states

The ordering \(\sqsubseteq \) introduced in Sect. 4.5 allows for the comparison of some cache states to each other irrespective of the upcoming trace of memory accesses. It is however a partial ordering and only compares two states with similar or included cache contents. As illustrated in Theorem 3, ordering the contribution of cache contents which do not include each other requires the consideration of future accesses as captured by their forward reuse distance. The definition of an optimal join operation, through the optimal allocation of capacity to cache states, should ideally minimise the execution time on the worst-case path. However, multiple, incomparable paths would need to be considered of which the worst-case is unknown. We instead rely on a heuristic to prioritise the most beneficial cache states through a ranking system.

The proposed ranking is based on a two sieves approach: (i) the number of useful blocks in cache are first compared with more blocks ranking higher, (ii) cache states are then compared based on their expected hit probability. As a result, we can compare the ranks of two cache states:

$$\begin{aligned} C \le _{\textit{rank}} C' = {\left\{ \begin{array}{ll} \texttt {true} &{} \left| \{e | e \in C \wedge \textit{frd}^G_{\textit{min}}(e) \ne \infty \}\right| \le \left| \{e | e \in C' \wedge \textit{frd}^G_{\textit{min}}(e) \ne \infty \}\right| \\ \texttt {true} &{} \sum _{e \in C} \hat{P}(e^{\textit{hit}}) \le \sum _{e \in C'} \hat{P}(e^{\textit{hit}}) \wedge \\ &{} \left| \{e | e \in C \wedge \textit{frd}^G_{\textit{min}}(e) \ne \infty \} \right| = \left| \{e | e \in C' \wedge \textit{frd}^G_{\textit{min}}(e) \ne \infty \}\right| \\ \texttt {false} &{} \textit{otherwise} \end{array}\right. } \end{aligned}$$
(58)

The first sieve prioritises cache states whose contents are likely to include others and hold more information. As per Theorem 1, the loss of information in a cache state cannot decrease the execution time distribution of an upcoming trace of accesses, implicitly \(C \subseteq C' \Rightarrow C \le _{rank} C'\). The sum of their blocks’ hit probabilities settles the rank of same-sized cache states, with a higher sum resulting in a higher rank. Each cache state is reduced to the minimum forward reuse distances of the blocks it holds. Those are used to estimate the corresponding hit probabilities of upcoming accesses by adapting the formula proposed in earlier approaches:

$$\begin{aligned} \hat{P}(e^{\textit{hit}}) = \left\{ \begin{array}{ll} 0 &{} \quad \textit{frd}^G_{\textit{min}}(e) \ge N\\ \left( \frac{N-1}{N}\right) ^{\textit{frd}^G_{\textit{min}}(e)}&{} \quad \textit{otherwise} \end{array}\right. \end{aligned}$$
(59)

It would seem intuitive to rely solely on the reuse distances of the blocks held in a cache to define its rank. Yet, a block in the cache may have a low minimum forward reuse distance, increasing its rank, but be reused solely on a path where no other block is reused. To reduce the complexity of the heuristic, it does not distinguish between the different subsequent paths and so captures and compares only the best possible reuse patterns, even though in some cases these may be optimistic. The forward reuse distances of blocks in the two states might interleave, like the upcoming accesses to these blocks, or vary depending on the considered subsequent path. Theorem 3 on the impact of a replacement on execution time cannot be used in such a context. Some cache states may be beneficial on a specific path, but be outranked on others. This prevents the direct comparison of cache states in the general case.

Our aim is to improve precision in the pWCET estimate, hence the heuristic aims to preserve capacity for cache blocks that will upon their next access result in a high probability of a cache hit. This happens, at least on some forward paths, for blocks with a small forward re-use distance. Preserving capacity for blocks with a larger forward re-use distance would likely result in a smaller probability of a cache hit and a more pessimistic overall pWCET estimate. (Note the ranking is only a heuristic and we do not claim that it makes optimal choices.)

6.2 Capacity conserving join

The join operator introduced earlier may result in lost capacity if the contents of states on alternative paths do not exactly match. Consider states \(\{a,b,e\}\) and \(\{b,c,e\}\) respectively in \(S'_a\) and \(S'_b\) along with others. They both include states \(\{b,e\}\), \(\{b\}\), \(\{e\}\) and \(\emptyset \) and their capacity could be allocated to whichever is the highest ranking one. \(\{a,e\}\) on the other hand is a valid approximation of \(\{a,b,e\}\) in which it is included, but does not approximate \(\{b,c,e\}\).

The capacity conserving join, to reduce waste, considers the cache states included in states from either incoming path, \(S'_a\) and \(S'_b\), by decreasing rank. Each considered cache state C is allocated as much of the remaining capacity of the states \(C_a\) (respectively \(C_b\)) it approximates in \(S'_a\) (resp. \(S'_b\)) as possible. The capacity that can be allocated to C is bounded by the minimum cumulative capacity of the states it approximates in \(S'_a\) and \(S'_b\). We also ensure that the overall contribution of a state \(C_a\) or \(C_b\) to state C does not exceed its capacity. This is a requirement for the resulting C to be a valid approximation as per the \(\sqsubseteq \) operator defined in Sect. 4.5. Algorithm 1 outlines the join process, and we further illustrate it with a simple example.

figure d

Example 2 Consider the previous example (Example 1) after the cache states have been reduced to only their common blocks in lines 1–3. All cache states included in \(S'_a\) (\(\{a,b,c\}\), \(\{a,b\}\), \(\{b\}\), etc.) are present in \(S'_b\) (line 4). Assuming the upcoming sequence of accesses is [acb], the considered states ordered by decreasing rank are:

\(S'_a\)

\(S'_b\)

\((\{a,b,c\},24/64,\mathcal {D})\)

\((\{a,b,c\}, 6/64,\mathcal {D})\)

\((\{a,c\}, 12/64,\mathcal {D})\)

\((\{a,c\}, 18/64,\mathcal {D})\)

\((\{a,b\}, 12/64,\mathcal {D})\)

\((\{b,c\}, 24/64,\mathcal {D})\)

\((\{b,c\}, 6/64,\mathcal {D})\)

\((\{a\}, 12/64,\mathcal {D})\)

\((\{c\}, 4/64,\mathcal {D})\)

\((\{c\}, 6/64,\mathcal {D})\)

\((\{b\}, 3/64,\mathcal {D})\)

\((\{\}, 1/64,\mathcal {D})\)

The first iteration of the capacity conserving join focuses on \(\{a,b,c\}\) which no other state can provide capacity to. As a consequence, after the first iteration, \(\textit{states}_1 = \{(\{a,b,c\},6/64,\mathcal {D})\}\). In \(S'_a\), the state has a remaining capacity of \(\frac{18}{64}\) which could be used to accommodate any of the contents of size 2 or less (i.e. \(\{a,c\}\), \(\{a,b\}\), \(\{b,c\}\), etc.). In particular, during the second loop iteration when contributions to the capacity of \(\{a,c\}\) are gathered from \(S'_a\) and \(S'_b\) (lines 7–10), we have:

\(\textit{contribution}_a\)

\(\textit{contribution}_b\)

\((\{a,b,c\},18/64,\mathcal {D})\)

 

\((\{a,c\}, 12/64,\mathcal {D})\)

\((\{a,c\}, 18/64,\mathcal {D})\)

The presence of both a and c, captured by state \(\{a,c\}\), can therefore be guaranteed with probability \(\frac{18}{64}\) on both paths. The capacity of states in \(S'_a\) and \(S'_b\) is decreased accordingly (lines 15–21). Capacity is first picked from the lowest ranking states, such that in our example \(\{a,b,c\}\in S_a\) still has a remaining capacity of \(\frac{12}{64}\) (the \(\frac{18}{64}\) allocated to \(\{a,c\}\) minus the contribution of the lower ranking \(\{a,c\}\) in \(S_a\), \(\frac{12}{64}\)).

During this step, the execution time distribution obtained through the combination of the contributors’ distributions is also computed in \(d_a\) (see line 18) and \(d_b\) respectively for \(S'_a\) and \(S'_b\). An upper-bound of \(d_a\) and \(d_b\) is used when computing the resulting distribution for the conserved state (line 24). After the second iteration of the algorithm, \(\textit{states}_2 = \textit{states}_1 \cup \{(\{a,c\},12/64,\mathcal {D})\}\).

Once all states have been explored, the remaining capacity is gathered into the empty state (line 26). The conserved contents are:

\(\textit{states} = S_a \sqcup ^{capa} S_b\)

\((\{a,b,c\},6/64,\mathcal {D})\)

\((\{a,c\}, 18/64,\mathcal {D})\)

\((\{a,b\}, 12/64,\mathcal {D})\)

\((\{b,c\}, 6/64,\mathcal {D})\)

\((\{a\}, 0/64,\mathcal {D})\)

\((\{c\}, 6/64,\mathcal {D})\)

\((\{b\}, 3/64,\mathcal {D})\)

\((\{\}, 13/64,\mathcal {D})\)

Keeping only states with a non-null occurrence probability, the capacity conserving join results in:

\(\textit{states} = S_a \sqcup ^{capa} S_b\)

\((\{a,b,c\},6/64,\mathcal {D})\)

\((\{a,c\}, 18/64,\mathcal {D})\)

\((\{a,b\}, 12/64,\mathcal {D})\)

\((\{b,c\}, 6/64,\mathcal {D})\)

\((\{c\}, 6/64,\mathcal {D})\)

\((\{b\}, 3/64,\mathcal {D})\)

\((\{\}, 13/64,\mathcal {D})\)

Compare the resulting contents to that of the previously introduced join operation repeated for convenience:

\( S_a \bigsqcup S_b \)

\((\{a,b,c\},6/64,D)\)

\((\{a,c\},12/64,D)\)

\((\{b,c\},6/64,D)\)

\((\{c\},4/64,D)\)

\((\{\},36/64,D)\)

The solution resulting from the application of \(\sqcup ^{capa}\) dominates that of the previously introduced join operation. Indeed, a state C can only accommodate soundly for itself or a state it includes. With the proposed ranking heuristic this corresponds to a lower ranking state which the algorithm explores after C itself. The capacity of C is first used for C in the algorithm. As a consequence, the capacity allocated to a state is at least its minimum capacity in \(S_a\) or \(S_b\), e.g. \(\frac{12}{64}\) for \(\{a,c\}\). This minimum is the capacity that was allocated to the state in the previous join implementation. Different ranking heuristics could potentially lose this dominance relation.

The capacity join further keeps the same timing information as the standard \(\sqcup \) operation. The combined distributions and their weights are the same, but attached as a result of the operation to different, less pessimistic cache states. The same fragment of distribution in the standard operation will account for fewer or the same amount of misses using the capacity join.

7 Worst-case path reduction

Approximations of the cache contention or the contents of abstract cache states occur on control flow convergence, when two paths in the control flow graph meet. This ensures the validity of the bounds computed by SPTA whatever the exercised path at runtime, while keeping the complexity of the analysis under control. The complete set of possible paths need not be made explicit; however, the loss of information that may occur on flow convergence decreases the tightness of the computed pWCET.

In most applications, there exists some redundancy among paths with regards to their contribution to the pWCET. If a path can be guaranteed to always perform worse than another (\(\mathcal {D}(\pi _b)\ge \mathcal {D}(\pi _a)\)), the contribution of the former to the pWCET dominates that of the latter, \(\mathcal {D}(\pi _b) = \mathcal {D}(\pi _b) \odot \mathcal {D}(\pi _a)\). In which case, the latter path can be removed from the set of paths considered by the analysis, hence reducing the complexity of the control flow, while preserving the soundness of the computed upper-bound.

In this section, we define the notion of inclusion between paths and prove that path inclusion is a sub-case of path redundancy; the execution time distribution of an including path dominates that of any paths it includes. Based on this principle, we introduce program transformations to safely identify and remove from the control-flow paths that are included in others. This improves the precision of the analysis.

Worst-case execution path (WCEP) reduction includes a set of varied modifications: empty conditions removal, worst-case loop unrolling, and simple path elimination. They apply on the logical level, during analysis, and unlike path upper-bounding approaches (Kosmidis et al. 2014) do not require modifications of the object or source code for pWCET computation.

7.1 Path inclusion

A path is said to include another if it contains at least the same sequence of ordered accesses, possibly interleaved with additional ones. As an example, consider paths \(\pi _a = [a,b,c,e]\) and \(\pi _b = [a,b,c,d,a,e]\) where \(\pi _a\) is included in \(\pi _b\). The former path can be split into sub-paths \(\pi _S=[a,b,c]\) and \(\pi _E=[e]\), such that \(\pi _a = [\pi _S,\pi _E]\). \(\pi _b\) can then be expressed as the interleaving of \(\pi _S\) and \(\pi _E\) with \(\pi _V = [d,a]\), i.e. \(\pi _b = [\pi _S,\pi _V,\pi _E]\). Similarly, \(\pi _b\) includes [acde], but not [bac].

Definition 1

(Including path) Let \(\pi _a\) and \(\pi _b\) be two paths, such that \(\pi _a\) is the concatenation of two sub-paths \(\pi _S\) and \(\pi _E\): \(\pi _a = [\pi _S ,\pi _E]\). The inclusion of \(\pi _a\) in \(\pi _b\), denoted \(\pi _a \unlhd \pi _b\), is recursively defined as either \(\pi _b=[\pi _S,\pi _V,\pi _E\)] or, \(\pi _b = [\pi _S,\pi _V,\pi _E']\) where \(\pi _E \unlhd \pi _E'\) and \(\pi _E \ne \pi _E'\)

Theorem 5

The execution time distribution of a path \(\pi \) prefixed by an access to block b upper-bounds that of path \(\pi \), \(\mathcal {D}(\pi ,s) + \mathcal {H} \le \mathcal {D}([[b],\pi ],s)\).

Proof

As per (21), the property trivially holds if [b] is a hit, \(\pi \) executes starting from the same cache state s in both cases. We focus on the case where [b] is a miss from s. This results in N possible cache states \(s[l_i=b]\) such that, thanks to Theorem 2:

$$\begin{aligned} \mathcal {H} + \mathcal {D}(\pi ,s)\le & {} \mathcal {M} + \sum _{i \in [1,N]} \frac{1}{N} \cdot \mathcal {D}(\pi ,s[l_i=b]) \end{aligned}$$
(60)
$$\begin{aligned} \mathcal {H} + \mathcal {D}(\pi ,s)\le & {} \mathcal {D}([[b],\pi ,s) \end{aligned}$$
(61)

\(\square \)

For the sake of readability, we omit in the following the cache state s when comparing the execution time distributions of two paths in the following; two paths are always compared using the same input cache state, \(\mathcal {D}(\pi ) \le \mathcal {D}(\pi ') \Leftrightarrow \mathcal {D}(\pi , s) \le \mathcal {D}(\pi ', s)\).

Theorem 6

The execution time distribution of a path \(\pi _a\) prefixed by path \(\pi _s\) upper-bounds that of path \(\pi _a\) alone, \(\forall \pi _s, \pi _a, \mathcal {D}(\pi _a) \le \mathcal {D}([\pi _s,\pi _a])\).

Proof

From Theorem 5, we know that \(\mathcal {D}(\pi _a) \le \mathcal {D}([[v_n],\pi _a])\) which can be extended to \(\mathcal {D}(\pi _a) \le \mathcal {D}([[v_1,v_2,\ldots ,v_n],\pi _a])\) since \(\mathcal {D}([[v_2,\ldots ,v_n],\pi _a]) \le \mathcal {D}([[v_1,v_2,\ldots ,v_n],\pi _a])\) and so on. The relation holds for prefixes of arbitrary lengths. \(\square \)

Theorem 7

(Included path ordering) If \(\pi _a\) is included in \(\pi _b\), then the execution time distribution of \(\pi _b\) is greater than or equal to that of \(\pi _a\), \(\pi _a \unlhd \pi _b \Rightarrow \mathcal {D}(\pi _a) \le \mathcal {D}(\pi _b)\)

Proof

We prove this property by induction.

Base case: We need to prove that if \(\pi _a \unlhd \pi _b\) such that \(\pi _a = [\pi _S,\pi _E]\) and \(\pi _b = [\pi _S,\pi _V,\pi _E]\), then \(\mathcal {D}(\pi _a) \le \mathcal {D}(\pi _b)\). From Theorem 6, we know that \(\mathcal {D}(\pi _E) \le \mathcal {D}([\pi _V,\pi _E])\).

The execution of \(\pi _S\) cannot be impacted by accesses in either \(\pi _E\) or \(\pi _V\). It is therefore the same on both paths \(\pi _a\) and \(\pi _b\). As proved in Theorem 6, whatever cache state is left by the execution of \(\pi _S\), the execution time distribution of \([\pi _V,\pi _E]\) is either greater than or equal to that of \(\pi _E\). Therefore, \(\mathcal {D}(\pi _a) \le \mathcal {D}(\pi _b)\).

Inductive step: Let us assume \(\pi _a = [\pi _S,\pi _E]\) and \(\pi _E'\) is such that \(\pi _E \unlhd \pi _E'\) and \(\mathcal {D}(\pi _E) \le \mathcal {D}(\pi _E')\). We need to prove that for \(\pi _b = [\pi _S,\pi _V,\pi _E']\), \(\mathcal {D}(\pi _a) \le \mathcal {D}(\pi _b)\). From Theorem 5, we know that \(\mathcal {D}([\pi _V,\pi _E']) \ge \mathcal {D}(\pi _E')\), and as a consequence \(\mathcal {D}([\pi _V,\pi _E']) \ge \mathcal {D}(\pi _E)\). Further, the execution time distribution of \(\pi _S\) is not impacted by accesses in either \(\pi _V\), \(\pi _E\), or \(\pi _E'\) and is the same in \(\pi _a\) and \(\pi _b\), hence \(\mathcal {D}(\pi _a) \le \mathcal {D}(\pi _b)\). \(\square \)

We now extend the notion of path inclusion to sets of paths. A set of paths \(\varPi \) is a path-included set of \(\varPi ^{\circ }\) if each path in \(\varPi \) is included in a corresponding path in \(\varPi ^{\circ }\), \(\varPi \unlhd \varPi ^{\circ } \Rightarrow \forall \pi \in \varPi , \exists \pi ^{\circ } \in \varPi ^{\circ }, \pi \unlhd \pi ^{\circ }\). As a consequence, for each path \(\pi \in \varPi \), there is a path in \(\varPi ^\circ \) the actual pWCET of which also upper-bounds the execution time distribution of \(\pi \). The actual pWCET of \(\varPi ^{\circ }\) is thus an upper-bound on the execution time distributions of all paths in \(\varPi \), \(\forall \pi \in \varPi , \mathcal {D}(\varPi ^{\circ }) \ge \mathcal {D}(\pi )\). As the estimated pWCET of a path \(\hat{\mathcal {D}}(\pi )\) is an upper-bound on its execution time distribution, \(\mathcal {D}(\pi ') \le \hat{\mathcal {D}}(\pi ')\), it is sufficient to perform the pWCET analysis of a CFG G on a reduced set of paths which path-includes the set \(\varPi (G)\).

7.2 Empty conditions removal

Simple conditional constructs may induce paths that are included in others. In particular, any path that goes through an empty branch or case is included in any alternative branch which triggers memory accesses. The edges in a CFG which represent such cases can be safely removed to reduce path indeterminism during pWCET analysis, improving the precision of the results.

Figure 3 gives an example of this for an \(\textit{if}\text {-}{} \textit{then}\) construct with an empty else branch. At point fi in the program, the analysis accounts for the eviction by accesses in \(\textit{then}\) of blocks present at the end of \(\textit{if}\). But if the empty edge is kept, any cache block loaded by the \(\textit{then}\) branch cannot be considered as present by the analysis at fi. This reduces the knowledge of the cache contents, and the precision of the resulting pWCET distribution. By removing the edge corresponding to the empty branch we remove this source of pessimism.

Fig. 3
figure 3

Simple if-then conditional structure. The edge from \(\textit{if}\) to fi, through the empty else case, can be removed for pWCET estimation

An edge from vertex \(v_p\) to \(v_i\) corresponds to an empty path if there is an alternative exit from \(v_p\) through \(v_j\) which later reaches \(v_i\). The notion of post-dominators, as an example, can be used to simply capture a subset of those empty branches. In Fig. 3, any path to the program exit through \(\textit{if}\) or \(\textit{then}\) will traverse fi, which post-dominates both \(\textit{if}\) and \(\textit{then}\). More formally:

$$\begin{aligned} \begin{array}{l} \forall v_p \in V, \forall v_i \in \textit{successors}(v_p){\setminus } \{v_p\}, \\ \;\;\;\;\exists v_j \in \textit{successors}(v_p){\setminus }\{v_p\} \wedge v_i \ne v_j \wedge v_i \in \textit{post}\text {-}{} \textit{dom}(v_j) \\ \quad \Rightarrow \varPi (L) \unlhd \varPi (L{\setminus }\{(v_p,v_i)\}) \end{array} \end{aligned}$$
(62)

The collecting approach integrates worst-case path computation and cache contribution estimation, referred to as high and low level analyses respectively in prior WCET estimation approaches (Puschner and Koza 1989), and ignores most feasibility constraints. This may result in unnecessary pessimism if the infeasible paths are expensive. Reduction of the different scenarios in the CFG, e.g. by expanding the CFG to only model feasible paths, allows the capture of some flow constraints at the cost of an increase in the size of the considered flow.

7.3 Loop unrolling

Natural loop constructs are a source of path redundancy. In particular, paths which do not exercise the maximum number of iterations of a loop they traverse have an including counterpart. An iteration of loop \(l=(v_h,V_l)\) starts with a transition from its header \(v_h\) to any of its nodes \(v_n\in V_l\). Conversely, any iteration, with the exception of the last, ends with a transition back to the header \(v_h\), through a back-edge. The set of paths \(\varPi _{\textit{iter}} = [\varPi (V_l{\setminus }\{v_h\}),[v_h]]\) captures the paths followed during a complete iteration through loop l.

A valid path which captures n iterations can be expressed as \([[v_h],\pi _1,\ldots ,\pi _{n-1},\) \(\pi _{\textit{last}}]\) with \(\forall i, 1 \le i < n, \pi _i \in \varPi _{\textit{iter}}\), and \(\pi _{\textit{last}}\) as the last iteration of the loop. \(\pi _{\textit{last}}\) is a path in \(\varPi (V_l{\setminus }\{v_h\})\) followed by a node outside the loop. We denote by \(\varPi _{n}\), the set of paths which iterate n times through the loop l. A path in \(\varPi _{n+1}\) can be expressed as \([[v_h],\pi _1,\ldots ,\pi _{n-1},\pi _{n},\pi _{\textit{last}}]\) with \(\pi _{n} \in \varPi _{\textit{iter}}\), i.e. each path in \(\varPi _n\) is included in a path of \(\varPi _{n+1}\). By extension, the set of paths \(\varPi _{\textit{max}\text {-}{} \textit{iter}(l)}\) path-includes all other sets of paths which iterate over l at least once.

As an example, consider the loop \(l=(b,\{b,c,d,e\})\) in Fig. 4. The path \(\pi _1 = [a,b,d,e,f]\) iterates a single time through l, with \(\pi _{\textit{last}} = [d,e]\). The valid iteration sequences in this example are [deb] and [ceb]. By inserting one iteration before the last in \(\pi _1\), we obtain the valid paths \(\left[ [a,b],[d,e,b],[d,e,f]\right] \) and \( \left[ [a,b],[c,e,b],[d,e,f]\right] \) respectively. Both paths do indeed include \(\pi _1\).

Fig. 4
figure 4

Simple do-while loop structure with an embedded conditional. The set of paths which iterate \(x+1\) times through loop l includes all paths with fewer iterations

In our model, we only restrict the maximum number of iterations of a loop. Every iteration may be the last; there is no guarantee that a loop goes always through the same number of iteration when it is executed. The loop unrolling algorithm hence operates without knowledge of the exact number of iterations of the loop. Every unrolled iteration is connected to the successors of the loop. As per Theorem 7 and the inclusion property for consecutive loop iterations, it is sufficient for pWCET estimation to only consider paths where each loop, when executed, goes through its current maximum number of iterations. The unrolling of loop l assumes \(\textit{max}\text {-}{} \textit{iter}(l,\textit{ctx})\) as the exact iteration count of loop l. In effect, when unrolling any iteration of loop l besides the last, edges from nodes in the loop to nodes outside l are discarded. Conversely, unrolling the last iteration implies conserving only the nodes and edges of l which lead to a loop exit.

This property holds in natural loops as long as any path taken during an iteration can be taken as well during any other iteration. Complex access patterns or flow constraints, e.g. if a path can only be executed once per execution of a loop, are a challenge to this assumption. As discussed in Sect. 6.2, the collecting approach integrates both worst-case path computation and estimation of the cache contribution, ignoring most path feasibility constraints. Expansion of the CFG to capture those constraints in its flow can be applied at the cost of a more complex flow (Fig. 5).

Fig. 5
figure 5

Simple do-while loop structure (Fig. 4) unrolled assuming \(\textit{max}\text {-}{} \textit{iter}(l)=3\). The unrolled (dashed) back-edges are only kept when a generic loop unrolling algorithm is used. They are removed when \(\textit{max}\text {-}{} \textit{iter}(l)\) iterations are enforced

The same principles hold for call inlining. Recursion is also a source of path redundancy. Recursive calls manifest as repetitions in the call stack of an application. Here, a single source node is attached to the CFG of each procedure, which identifies its start. The source node therefore behaves similarly to the head of a loop, and is a guaranteed entry to each call. The same logic applies to both natural loops and recursive calls. When performing virtual or physical inlining, the analysis forces recursion up to the defined bound.

7.4 Access renaming

Path inclusion relies on the verbatim sequence of accesses to detect redundancy between paths. Even the slightest dissimilarity between alternative sequences throws off the property. Some accesses are known to perform worse than others at a given point in time. Renaming an access in a sequence to a worse performing target one, i.e. changing the target of the access, can smooth the differences between paths such that the renamed path is included in an alternative path of its original counterpart. The renamed path then acts as an intermediate bound between the original one and the including alternative, hence providing an argument for the removal of the original path. We now introduce a set of conditions that ensure the dominance of the execution time distribution of a renamed path over its original counterpart. If all transformations from the original validate these properties, the renamed path dominates the original. The renamed path may further be included in an alternative path. The original is then known to be redundant with this alternative and can be omitted during analysis.

Let \(\pi =[v_1,v_2,\ldots ,v_{k-1},v_k]\) be a sequence of k accesses. \(\pi (e\rightarrow b)\) denotes the renaming of all accesses to memory block e to b in \(\pi \), \(\pi (e\rightarrow b) = [v_1',v_2',\ldots ,v_{k-1}',v_k']\) where \(\forall i\in [1,k], v_i' = v_i\) if \(v_i \ne e\) and \(v_i = b\) otherwise. By definition, renaming e to b has no impact on \(\pi \) if it does not access e. \(\pi (e\rightarrow b)(c\rightarrow d)\) identifies a rename from e to b followed by a rename from c to d on the resulting sequence. Note that if no destination block is used as a source block, the order of the renames is irrelevant. For instance \(\pi (e\rightarrow b)(c\rightarrow d) = \pi (c\rightarrow d)(e\rightarrow b)\), but \(\pi (e\rightarrow b)(b\rightarrow c) \ne \pi (b\rightarrow c)(e\rightarrow b)\).

We identified three conditions to ensure the dominance of the pWCET of a renamed path \(\pi '=[\pi _S,\pi _V(e\rightarrow b),\pi _E]\) over its original \(\pi =[\pi _S,\pi _V,\pi _E]\), where \(\pi _V = [e]\) or \(\pi _V = [e,v_1,\ldots v_j,e]\), and further prove their impact:

  • No enclosure There is no access to b over the renamed sequence \(\pi _V\), \(\forall v_i \in \pi _V, v_i \ne b\).

  • Prefix ordering b is no more likely to be in the cache than e after \(\pi _S\) (before \(\pi _V\)). This occurs when the closest access to e before \(\pi _V\), that is the last access to e in \(\pi _S\), is posterior to the last access to b in \(\pi _S\), \(\textit{rd}(e,\pi _S) < \textit{rd}(b,\pi _S)\).

  • Suffix ordering b is no more likely to trigger a hit than e if present in cache after \(\pi _V\) (before \(\pi _E\)). The first access to e after \(\pi _V\), i.e. in \(\pi _E\), is before the first access to b, \(\textit{frd}(e,\pi _E) < \textit{frd}(b,\pi _E)\).

Some inputs may result in lower estimated execution time distributions through analysis for the renamed path over the original one. This is because of the reduced pessimism in its analysis. Nevertheless, the computed pWCET for the renamed path, irrespective of the actual input, upper-bounds the exact pWCET for the original path.

Theorem 8

(Renamed path ordering) Given a path \(\pi \) divided into three sub-paths \(\pi =[\pi _S,\pi _V,\pi _E]\), where \(\pi _V=[e,v_1,\ldots ,v_k,e]\). The pWCET of \(\pi \) is smaller than or equal to that of the renamed sequence \(\pi _r=[\pi _S,\pi _V(e\rightarrow b),\pi _E]\), \(\mathcal {D}(\pi ) \le \mathcal {D}(\pi _r)\), if:

  • there is no access to b in \(\pi _V\);

  • the reuse distance of e before \(\pi _V\) is smaller than that of b at this point;

  • the forward reuse distance of e at the end of \(\pi _V\) is smaller than that of b at this point.

Proof

See Appendix.\(\square \)

7.4.1 Simple path elimination

Access renaming allows for a wide range of transformations between paths within a program. We aim at reducing the set of paths that need to be considered during the analysis of an application without increasing its pWCET. An ideal solution would consider each path individually. Each should then be matched against its larger alternatives to check for inclusion using rename operations. This approach is impractical in practice due to the sheer number of paths and the complexity of the matching problem over large sequences of accesses.

Our initial approach instead matches and eliminates simple paths in conditionals, that is branches which do not exhibit control flow divergence. Consider the example in Fig. 6, the branches of conditional d are simple paths. Only the lower branch of the first enclosing conditional b is a simple path as the upper one diverges at d. Focusing on simple paths reduces the exploration space, both in terms of considered paths and their relative size. The considered paths are likely to be similar and match using few rename operations. This simplifies path elimination in the CFG. Figure 6 however illustrates the restrictions of this approach. The topmost branch [dfb] is redundant with the bottom one [edfb], but is only compared with [deab] which it does not match.

Fig. 6
figure 6

Embedded conditional structures. The maximum reuse distance of accesses is given as superscript. [fb] and [eab] qualify as simple paths from d to \(b^3\) and are matched against each other. [dfb] and [edfb] are not since the control flow may diverge at d

We use a simple method outlined in Algorithms 2 and 3 to test inclusion and perform renaming at the same time. The first algorithm illustrates the traversal of a CFG, and the identification of the suitable candidates for simple path elimination through renaming. The successors of each conditional vertex are considered pairwise (lines 5–14). Should there be a simple path starting from each vertex in the pair to the same node (dom), the redundant one if any is removed (line 10). Paths may converge inside a simple path, as the simple path definition only restricts flow divergence. The removal of nodes subsequent to such a convergence may result in the removal of other non-redundant paths. The \(\textit{RemovePath}\) method removes at least the edge from \(\textit{vertex}\) to \(\textit{renamed}\) but may need to conserve other nodes. Using the flow depicted in Fig. 7 as an example, the middle path [abfb] is identified as redundant with the top one [ebfb], but not [befb]. Only \([a^0]\) and \([b^1]\) can effectively be removed. Removing vertices [f] and \([b^2]\) on the lowest branch would remove [befb] from the set of possible paths which is unsafe.

Fig. 7
figure 7

Embedded conditional structures. The maximum reuse distance of accesses is given as superscript. All three paths [ebfb], [abfb] and [befb] qualify as simple paths and are matched against each other. [abfb] is captured as redundant with respect to [ebfb], but not [befb]

The recursive IsRedundant method, outlined in Algorithm 3, focuses on asserting the redundancy of two sub-paths of a CFG using renaming. The algorithm progresses access by access, each call to IsRedundant considers the first access in the renamed path \(\pi _v\) and possible matches in \(\pi _r\). It explores the following options (i) match the address on the two paths (line 8), (ii) attempt renaming the access on path \(\pi _v\) to one on path \(\pi _r\) (line 12), or (iii) skip an access on the longest trace (on line 7, the operation removes the head of path \(\pi _r'\)). If it reaches the end of path \(\pi _v\), that path is identified as redundant with respect to \(\pi _r\); there is a sequence of renames which results in its inclusion in \(\pi _r\). Conversely, if there are not enough accesses left in \(\pi _r'\) to match the ones in \(\pi _v'\), the algorithm returns false. Hence, renames only occur on the shortest path, as it does not hold enough accesses to include the longer one.

The two sub-paths compared in the IsRedundant method may be reached through multiple paths in the CFG and lead to the execution of different suffixes. To rename block e to b, the operation must be valid for all prefixes and suffixes of the considered path \(\pi '_v\). Any access to b prior to the renamed segment should always be followed by an access to e before \(\pi _V\) later in the CFG (Prefix ordering condition). Conversely, an access to e must precede the next access to b on all subsequent paths where b is accessed (Suffix ordering condition). Using the minimum forward and backward reuse distance of accesses in the CFG does not yield the required guarantee, only a necessary condition. Indeed, b may be accessed on a path where e was not accessed and still have higher minimum reuse and forward reuse distances. However, the reuse distances can be used to speed up the validation process. Similarly, the first met access to either block reduces the search space as it validates the property for the current branch (on e) or proves it does not hold (on b).

figure e
figure f

7.4.2 Control flow graph segmentation

WCEP reduction methods aim to remove included paths whose contribution to the execution time distribution is no greater than some alternative worst-case paths. This reduces the number of accesses to be analysed and impacts the complexity of the approach. To further reduce this contribution, we present preliminary work towards the reduction of the analysed program segments through CFG partitioning (Ballabriga and Cassé 2008). This method has been first explored by Pasdeloup  (2014) through heuristics tailored for SPTA.

Conceptually, the cache is flushed at defined points in the program, on partition boundaries, to reduce the number of in-flight states. Flushing is in that case an abstraction of the analysis, the system is not expected to enforce this behaviour at runtime. Partitions divide the CFG into non-overlapping sections of consecutive nodes. We select flush points such that a minimum number of M misses occurs between two flushes. This allows control over the complexity and precision trade-off for the analysis. The process is sound as the loss of information regarding cache contents cannot decrease the execution time distribution of a trace as per Theorem 1. The flush operation relies on the merge defined in (63):

$$\begin{aligned}&\textit{Flush} : 2^\mathbb {CS} \rightarrow 2^\mathbb {CS} \end{aligned}$$
(63)
$$\begin{aligned}&\quad \textit{Flush}\left( \left\{ \begin{array}{l} (C_0, P_0, \mathcal {D}_0) \\ \vdots \\ (C_n, P_n, \mathcal {D}_n) \end{array} \right\} \right) = \biguplus \left( \left\{ \begin{array}{c} (\emptyset , P_0, \mathcal {D}_0) \\ \vdots \\ (\emptyset , P_n, \mathcal {D}_n) \end{array} \right\} \right) \end{aligned}$$
(64)

Our partition of the CFG focuses on consecutive single-entry single-exit (SESE) regions (Ballabriga and Cassé 2008). All control flow enters a SESE region through its single entry and leaves through its exit. Examples of valid SESE regions in a CFG are highlighted in Fig. 8a. Consecutive SESE regions are connected to at most one predecessor and one successor SESE region such that all control flow in the application is captured by a single path through the SESE region. As an example, consider the decomposition in Fig. 8b corresponding to the CFG in Fig. 8a. There is no cache-related dependency during the analysis of consecutive SESE regions, each is analysed assuming an empty initial cache state. Segments can be analysed independently with regards to the cache and their estimated pWCET convolved to compute that of the complete CFG.

Fig. 8
figure 8

Decomposition of a CFG into single-entry single-exit regions. a Example of single-entry single-exit regions. b Example decomposition into consecutive regions

For a decomposition into consecutive SESE regions to be valid, the nodes that delimit the segments have to be executed in all paths in the CFG. Alternative paths stemming from the same branch must be part of the same region. Similarly, all nodes in a loop nest belong to a same region. Such nodes can be captured by the notion of post-dominators: a node \(v_p\) post-dominates \(v_n\) if every path from \(v_n\) to the end node \(v_e\) goes through \(v_p\). All valid candidate nodes have to be post-dominators of the entry node \(v_s\).

Algorithm 4 outlines the general process of selecting the flush points. The common path of the CFG G, post-dominators of its entry \(v_s\), is traversed in control flow order from the entry to the end of the graph. A new flush point is set if more than M misses can occur between the current post-dominator and the last selected flush point. The number of potential misses between two nodes, \(\textit{CountPotentialMisses}(v_n, v_e)\) is computed similarly to the maximum reuse distance, accounting for all accesses that are not guaranteed hits on paths between \(v_n\) and \(v_e\).

figure g

8 Evaluation

In this section, we examine the precision and runtime behaviour of the multi-path analysis introduced in this paper. In order to study the behaviour of the analysis with respect to different flow constructs, we provide results for a subset of the PapaBench application (Nemer et al. 2006), Debie (Holsti et al. 2000), and the Mälardalen benchmarks (Gustafsson et al. 2010). We present the results for a subset of benchmarks whose behaviour is representative of the ones observed across all experiments or illustrate interesting corner cases. Table 2 includes details for each benchmark on the maximum number of accesses, the distinct number of cache blocks, and the cyclomatic complexity Y of the CFG (without and with WCEP reduction) which lower bounds the number of paths. Also given are the analysis runtimes with 4 and 8 relevant blocks.

The control-flow graph and address extraction were performed using the Heptane (Colin and Puaut 2001) analyser, from the compiled MIPS R2000/R3000 executable obtained using GCC v4.5.2 without optimisations. We used the various different methods to evaluate the contribution of a 16-way fully-associative instruction cache with 32B lines.

The miss distribution for different benchmarks was computed using either the contention-based approach, the collection one, using different numbers of relevant blocks R, or the reuse distance-based path merging method outlined by Davis et al.  (2013). To provide a comparison with methods and replacement policies, a state-of-the-art analysis (Theiling et al. 1999) was used to determine the single, predicted worst-case bound on the number of misses for a LRU cache using the same parameters. We also performed a set of \(10^8\) simulations of the random cache behaviour to use as a baseline, effectively providing a lower bound on the pWCET. Here, the successor to each vertex in the simulated path was picked randomly among all of its valid successors, thus exploring the possible paths.

All of the WCEP reduction techniques described in Sect. 6 were used for analysis of the random replacement cache. LRU caches do not exhibit the properties required by Theorem 7. The pWCET estimates obtained for each configuration of the analysis, estimation method and number of relevant blocks, were always tighter with WCEP reduction. Regarding simulation, WCEP reduction reduces the set of paths to one more representative of the worst-case scenarios, in some cases resulting in a single worst possible execution path. Yet there is no guarantee that these transformations are sufficient, here the simulation results are only an indicative means of assessing the pessimism in our approach. Table 2 covers the impact of WCEP reduction on the cyclomatic complexity Y of the analysed benchmarks, an indicative lower bound on the number of paths in a CFG. Table 3 and Figure 9 present the estimated number of misses for the analysis of LRU and random replacement caches with a cutoff probability of \(10^{-7}\), i.e. the number of predicted misses exceeded with a probability no greater than \(10^{-7}\) at runtime. Of the 48 analysed benchmarks, 17, highlighted in Table 3, show the same or better estimated performance with a random replacement cache while 31 perform better with an LRU cache. Improvements over the LRU analysis tend to be limited. However as further illustrated for Papabench t4 in Fig. 10c, there is a potential margin for improvement in the analysis of the random replacement policy to further tighten its results over the LRU replacement policy.

Table 2 Properties of the analysed benchmarks and analysis runtime with R relevant blocks
Table 3 Estimated number of misses with LRU and random replacement caches
Fig. 9
figure 9

Estimated number of misses under LRU and Random replacement caches

The capacity-conserving join heuristic which allocates capacity to the cache states identified as the most valuable dominates the standard implementation. When comparing the precision of the different analysis techniques in Sect. 7.1 we therefore rely on the most favourable configuration, i.e. with WCEP reduction active and using the capacity-conserving join. The impact of the different mechanisms, joins and WCEP reduction, is further considered in Sects. 7.2 and 7.3 respectively. Finally, the complexity and runtime for different analysis configurations is evaluated in Sect. 7.4.

8.1 Relative precision of the analysis techniques

We first compare the result for different configurations in Fig. 10a–f. The figures show the complementary cumulative miss distributions (1-CDF) for a representative subset of benchmarks and configurations. The contention and path merging approaches are identified by red circles and blue crosses respectively. The number of relevant blocks R for the collecting approach is restricted to values of either 4 or 8 (identified by orange triangles) which is sufficient to capture most of the locality in the considered applications. The distribution obtained through simulation (identified by green squares) is also presented. The number of misses predicted by analyses for the deterministic LRU configuration is identified by a dark purple vertical line.

Fig. 10
figure 10

Estimated miss distribution of the different analysis methods under LRU or random replacement policies and different analysis configurations (Color figure online)

In general, the use of the cache collecting method improves the precision of the analysis over the merging or purely contention-based approaches even on complex control flows, as illustrated by papabench t4 in Fig. 10c. On simple control flows, the two approaches behave similarly but the contention method still dominates the path merging method (see Fig. 10e). The merged path is as long as the longest path in the application but keeps the worst behaving accesses from shorter paths. When WCEP reduction can mostly extract the worst-case execution path, as with qurt in Fig. 10a, the main difference between the two approaches comes from the more precise estimation of the hit probability of individual accesses using contention methods.

The precision of the collection methods and the relative performance of LRU and random caches mostly depends on the size of the working set of tasks w.r.t. to the cache size or the number of relevant blocks. Similar behaviours were observed whether WCEP reduction successfully resulted in a single path or not. As the number of relevant blocks increases from 4 to 8, the estimates computed by the analysis improve. The gain is important on benchmarks like insertsort (see Fig. 10b) where some nested loops fit in the number of relevant blocks. However, precision is lost in qurt or ud w.r.t. the simulation results (see Fig. 10a, e) as the loops almost fit inside the cache but not within the number of relevant blocks. This also results in decreased performance w.r.t. LRU. The latter is in this case only subject to cold misses.

Another general observation is that as expected none of the distributions derived by analysis underestimates simulation. However, the simulation-based distributions cannot be guaranteed to be precise pWCET estimates. The simulations, lacking representative input data, may not exercise the worst-case paths. At best they provide lower bounds on the pWCET. We note that provision of representative input data is a key problem for measurement-based methods. There is no general conclusion regarding the dominance of the analysis of a LRU cache over simulation or analysis results for a randomised cache. When all iterative structures fit in the cache (see Fig. 10b), the LRU analysis outperforms the analysis of the random cache. As intra-loop conflicts grow, the benefits of the random replacement policy emerge and the new methods can capture such locality, resulting in tighter estimates than the analysis for a deterministic platform (see Fig. 10f). WCEP reduction reduces the reuse distance considered during analyses, whereas the stack distance for the LRU analysis remains the same since Theorem 7 does not apply. The path-merging approach under WCEP reduction may result in tighter estimates than the analysis of a deterministic replacement policy (see Fig. 10d).

The analysis results for the t4 and statemate benchmarks (see Fig. 10c, d) indicate that the cache collecting approach may sometimes compute more pessimistic estimates than the contention method. This behaviour stems from flow divergence in the control flow of both benchmarks. Path indeterminism hinders the relevant block heuristic, different blocks may be deemed as relevant on parallel paths. In such cases, upon flow convergence, the join function cannot keep blocks of either alternative. Further, the R relevant blocks are still considered as occupying cache space from the point of view of the non-relevant ones, effectively reducing the cache size. This illustrates the need for more sophisticated heuristics which take into account the behaviour of the analysis on alternative paths, or vary the number of relevant blocks depending on the expected benefits, and the computational cost.

In summary, our evaluation results show that the approaches to multi-path SPTA derived in this paper dominate and significantly improve upon the state-of-the-art path merging approach, determining less than one third as many misses in some instances. They were also shown to be incomparable with LRU analysis.

8.2 Benefits of the join operations to collecting approaches

The selection of relevant blocks is undoubtedly an important factor in the precision of the cache collecting approach. We compared additional configurations of the analyser, assuming a fixed number of 8 relevant blocks, to examine the impact of the join operations on the precision of the analysis. In particular, the experiments presented from Fig. 11a–f introduce a non state-conserving approach on path convergence. Using configuration empty (identified by blue pentagons) the cache contents are set to \(\emptyset \) on path convergence and the miss distribution is the maximum distribution of the alternative paths. The capacity configuration (identified by orange triangles) on the other hand corresponds to the use of the improved join operator. The intermediate line identifies the simple join operation we first introduced in Sect. 4.6 (purple squares).

Benchmarks which exhibit locality across branches of their conditionals benefit from the join function, as illustrated by crc, lcdnum, expint and compress in Fig. 11a, b, d and e respectively. The combination of both WCEP reduction and capacity conservation on flow convergence leads to tighter pWCET estimates in the case of crc, lcdnum, expint and compress. Reduction cannot remove all branches as they may not fall under the required constraints. The lcdnum benchmark is composed of a switch statement. The later cases share blocks with the conditions of the earlier ones, but add their own blocks. Hence, the resulting cache states differ but include each other. They can be captured by the capacity conserving heuristic. By construction, the capacity conserving join results in the tightest estimates and provides important improvements over the standard join on the crc application. The benefits of the capacity-conserving join over the standard one are more marginal on the compress benchmark (see Fig. 11e) which exhibits few branches with reused blocks not captured by the WCEP reduction.

Some benchmarks see little benefit from the proposed join function. statemate (see Fig. 11c) is composed of many nested conditional constructs which share few or no blocks. The cache contents diverge with the flow, and the join operation cannot assume any block is present. Locality in the statemate benchmark is captured thanks to the empty conditional approach of WCEP reduction. Some applications like matmult (see Fig. 11f) are reduced to a single path through WCEP reduction. Such scenarios obviously do not benefit from any join operation.

Fig. 11
figure 11

Estimated miss distribution for the random replacement policy under a fixed number of relevant blocks \(R=8\) and WCEP reduction, and different join operations (Color figure online)

8.3 Impact of WCEP reduction on analysis and simulation

WCEP reduction reduces path redundancy through the elimination of selected paths, such that both the analysis and the simulations are performed on a reduced control flow. We computed the miss distribution of the benchmarks with both families of methods with and without WCEP reduction. We present the result for a fixed probability of \(10^{-7}\), i.e. recording the number of predicted misses that will be exceeded at runtime with a probability no greater than \(10^{-7}\). For each of the ludcmp, cnt and compress benchmarks in Fig. 12a, b, and c respectively, we present the result using the original CFG, then adding WCEP unrolling (\(+\textit{unroll}\)), empty branch elimination (\(+\textit{branch}\)), and the renaming-based simple branch elimination (\(+\textit{rename}\)).

Fig. 12
figure 12

Estimated miss counts at a fixed probability of \(10^{-7}\) under random replacement using different reduction configurations and R relevant blocks. a Estimated miss count at \(10^{-7}\) for the ludcmp benchmark. b Estimated miss count at \(10^{-7}\) for the cnt benchmark. c Estimated miss count at \(10^{-7}\) for the compress benchmark (Color figure online)

Given a fixed configuration of the analysis (identified by a symbol and a colour), the distribution obtained with WCEP reduction is always smaller than the one obtained without it. In other words, the analysis is more precise when all transformations are active. Because of path redundancy, an increase in the number of relevant blocks can sometimes reduce the precision of the resulting estimate. This phenomenon still occurs when WCEP reduction is applied, but it is less prevalent.

The impact of the different transformations on the precision of the analysis results depends on the characteristics of the application to which they are applied. All transformations can be beneficial to benchmarks for the collecting approach. The contention approach may even benefit from empty path elimination (see Fig. 12a), when a block is accessed only on the non-empty alternative of a conditional its reuse distance gets lowered. For other accesses, such paths impact neither the reuse distance nor the contention as they hold no access. The elimination of redundant paths on the other hand increases the precision of the two methods.

The cnt benchmark, in Fig. 12b, illustrates an interesting scenario. When the empty branch elimination is used in combination with WCEP unrolling collecting methods get slightly less precise than when using WCEP unrolling on its own. This illustrates a limit of the ranking heuristic used by the capacity-conserving join. Empty branches result in a reduced minimum forward reuse distance for some accesses. This in turn impacts the allocation of capacity to cache states on path convergence, resulting in a better allocation without empty branch elimination.

We performed a set of \(10^8\) simulations on the control flow graphs of benchmarks with and without reduction. WCEP reduction results in greater measured execution time distributions. The transformations proposed in this paper eliminate some but not all redundant paths and reduce the set of possible paths to a set more focussed on worst-case scenarios. As for the analyses methods, the impact of each transformation depends on the benchmark to which it is applied. However, the application of WCEP reduction in the general case is not sufficient to guarantee the representative character of the resulting paths. In the case of the expanded compress benchmark, conditionals within loop structures are kept and there is no guarantee as to which alternation of paths results in the worst-case. On the other hand, the expanded matmult benchmark consists of a single trace of accesses.

8.4 Execution time

The runtime of the analysis, using a C++ prototype implementation, is presented in Fig. 13 using the WCEP reduction method and 0 to 12 relevant blocks. Measurements were made on an 8-core 64-bit 3.4Ghz CPU using the Ubuntu 12.04 operating system, with 2 instances of the analyser running in parallel. WCEP reduction was used as it increases the precision of the estimated cache states, and also the analysis runtime. We observe a growth in runtime as the number of relevant blocks increases. The runtime of the analysis is also significantly higher for larger benchmarks, edn, compress, and ndes, which contain the largest number of nodes.

The abstract cache state representation is partially responsible for the high runtime on the largest benchmark. The complexity of the update and join operations is tied to both the number of relevant blocks R and the number of potential misses on the longest path. (Fig. 13 combines the impact of both the program length and number of relevant blocks whereas Fig. 14 focuses on the number of instructions.) The number of relevant blocks affects the number of different cache contents which are tracked by the analysis at each step. Further as the number of analysed accesses increases, so does the size of the distributions held in the cache states and therefore the cost of operations such as the merge.

Fig. 13
figure 13

Runtime of the analysis for the presented benchmarks

Fig. 14
figure 14

Runtime of the analysis for repeated accesses to a sequence of n distinct blocks

The complexity of the analysis is of the order of \(O(|S|\times m\times \textit{log}(m))\), where m is the number of accesses in the program and |S| upper-bounds the number of possible cache states. |S| is the number of combinations of N or less elements picked amongst the R relevant blocks, when \(R < N\) then \(|S| = 2^R\). As demonstrated in the previous set of experiments, a limited number of relevant blocks is effective for typical cache associativities.

8.4.1 Reducing the complexity of the approach

The complexity of the introduced approach to SPTA for multi-path programs depends on both the number of relevant blocks R and the number of accesses m in the program. This section further examines the contribution of the program size to the runtime of the analyses and presents preliminary work towards its reduction using CFG partitioning as presented in Sect. 6.4.2.

Fig. 15
figure 15

Runtime of the analysis under CFG partitioning with segments of 1000 potential misses (Color figure online)

Fig. 16
figure 16

Estimated miss distribution for matmult under LRU and random replacement with analysed segments of \(M=1000\) potential misses (Color figure online)

The results presented in Fig. 13 focussed on the impact of the number of cache states through its ties to the relevant blocks R. The number of accesses m in each benchmark is fixed. We evaluate the impact of m on the complexity of the analysis in Fig. 14. It presents the runtime of the analysis of a repeated sequence of n accesses while assuming the same 16-way cache as in our previous experiments. The number of blocks in the repeated sequence n, the number of relevant blocks R and the cache associativity N impact the possible number of cache states |S| and therefore the initial growth of the runtime. Once the set of cache states to consider stabilises, the runtime for the different configurations follows a similar \(m\times \textit{log}(m)\) growth curve.

We defined a simple algorithm to split a program into consecutive SESE with at least M non-guaranteed hits on their longest path (Sect. 6.4.2). Segments are analysed independently assuming an empty input cache, and the resulting pWCET convolved to compute that of the full program. This approach effectively reduces the set of cache states on region boundaries to the empty state, a safe over-approximation as defined in Sect. 4.5. The resulting analysis runtime for the largest benchmarks is presented in Fig. 15 assuming a segment size M of 1000 misses.

Program partitioning reduces the runtime of our method over the analysis of the program as a single segment (see Fig. 13). As the analysis is applied to same-sized regions in all cases, the runtime of all benchmarks follow a similar growth with the number of relevant blocks. The remaining differences in runtime come from several factors. First, the length of the program impacts the complexity of the final convolution operation of the pWCET of each segment. Second, the consecutive segments on a multi-path program may hold more than M misses. Splits can only occur on a restricted set of vertices, namely those which post-dominate the entry of the CFG. Further, as shown in Fig. 14, misses and the working set of each segment impact the number of cache states kept during analysis. Finally, flow complexity also increases analysis time as more paths need to be considered in a single segment.

Figures 16, 17 and 18 present the distributions computed by the analyses for a relevant subset of the considered configurations. They present the analyses results for \(R=8\) relevant blocks using a single or multiple segments (filled or hollow triangles respectively). They also include the results for 12 relevant blocks under partitioning (hollow blue pentagons), as the runtime of this configuration is below that of the \(R=8\) single segment one. Simulations and deterministic LRU analyses results are also included (resp. with green squares and a dark purple line). WCEP reduction is active in all cases, except LRU.

The approximation of the cache contents on segment boundaries has adverse effects on the precision of the analysis. Indeed, the first few accesses in a segment may be classified as misses while the contents of the cache are being reloaded. This is illustrated for the matmult and edn benchmark respectively in Figs. 16 and 17. matmult exhibits an important locality at runtime, the impact of segment boundaries is such that it overshadows the increase in the number of relevant blocks. Yet, the segmented analysis with 12 relevant blocks only takes 285 seconds, against more than 7000 for the single segment with \(R=8\). The precision gain from the increase in the number of relevant blocks is much more important for edn, while the runtime of the \(R=12\) segmented analysis remains lower than that of the \(R=8\) full program one (2000s vs. 13,000s).

Fig. 17
figure 17

Estimated miss distribution for edn under LRU and random replacement with analysed segments of \(M=1000\) potential misses (Color figure online)

Fig. 18
figure 18

Estimated miss distribution for fft under LRU and random replacement policies with analysed segments of \(M=1000\) potential misses (Color figure online)

We observed that the fft benchmark (Fig. 18) only marginally benefits from an increase in the number of relevant blocks. The approximations on segment boundaries have almost no impact on the precision of the computed estimates given a fixed number of relevant blocks, \(R=8\). There is little reuse between the identified SESE regions in the program.

9 Conclusion and perspectives

The main contribution of this paper is the introduction of a more effective approach to multipath SPTA for systems that use a cache with an evict-on-miss random replacement policy. The methods presented in this paper build upon existing approaches for analysing single-path programs. We have pointed out where existing techniques for deterministic or probabilistic analyses could be applied to make improvements (Pasdeloup 2014; Maxim et al. 2012; Wegener 2012; Theiling et al. 1999).

We introduced conditions for the computation of valid upper-bounds on the possible cache states on control flow convergence and presented a compliant transfer function to illustrate the requirements. We further refined this join operation to improve the precision of the information kept on control flow convergence. This more sophisticated join operation relies on a heuristic ordering of cache states depending on their expected benefits in the upcoming accesses.

We also defined path redundancy, identifying path inclusion as a sub-case of redundancy. Based on these results, we presented worst-case execution path (WCEP) reduction to reduce the set of paths explored by the analysis, improving the tightness of the resulting timing estimates. We identified and proved the validity of sufficient conditions for the application of access renaming. This transformation allows for the identification of redundant paths beyond simple inclusion.

Our evaluations show that the analysis derived is effective at capturing the cache locality exhibited by different applications. The new methods significantly outperform the existing path merging approaches, predicting less than a third as many misses in one of the benchmarks. More precise results can be attained at the cost of an increased, user-controlled, complexity. They are also incomparable to estimates for deterministic LRU caches. The program transformations introduced proved effective at improving the precision of all SPTA configurations; of the 48 analysed benchmarks, 18 show the same or better estimated performance with a Random replacement cache while 31 perform better with an LRU cache.

9.1 Perspectives

This research can be extended in many ways. The transfer functions on control flow convergence compute valid bounds with regards to the ordering of cache states. They exhibit pessimism, different but more complex ranking heuristics could spread the capacity of cache states over more appropriate candidates. Second, the complexity of operations on the abstract domain contributes to the increasing runtime of the analysis as it traverses deep flow graphs. Future work could look at the interaction between existing methods to balance the complexity and the precision of the analysis. Another avenue for improvement is the heuristic for the selection of relevant cache blocks. More advanced approaches might improve the tightness of the results, or even introduce a varying number of relevant blocks across the application to focus the analysis effort on a specified area of the code.

Our approach integrates both cache behaviour and worst-case path estimation. Flow facts regarding loop iterations can be taken care of during unrolling. We nevertheless intend to take more flow facts into account to increase the applicability of the approach and further improve the WCEP reduction effect on reducing path complexity. We also intend to investigate the use of static methods to improve the representative character of the considered paths, and as a consequence ensure the soundness and improve the precision of the measurement-based approaches. Finally, the application of static probabilistic timing analysis to more complex cache configurations, including multiple levels of cache, remains an open problem (Lesage et al. 2013).