1 Introduction

Speculative execution attacks exploit micro-architectural design flaws and side channels in modern processors and enable unprivileged processes to exfiltrate sensitive information across security boundaries. These attacks have seriously undermined the fundamental security assumptions made in the design of the operating systems and have been in the spotlight since their very first public disclosure in early 2018. The most prominent examples of speculative execution attacks are Meltdown [27] and Spectre [23], and later variants, such as Foreshadow [41], Micro-architectural Data Sampling (MDS) [31, 37, 43], Load Value Injection (LVI) [42] are also well-known examples of such attacks.

In this paper, we apply the principle of defense-in-depth and propose SpecularizerFootnote 1, a software framework for uncovering speculative execution attacks using hardware performance tracing features available in commodity processors, i.e., hardware performance counters (HPC) and processor trace (PT). Specularizer complements existing defenses against speculative execution attacks, by offering a capability of logging both architectural and micro-architectural behaviors of the monitored software to enable forensic analysis and offline attack detection.

In contrast to prior work that detects cache side channels to identify speculative execution attacks [19], which can be easily circumvented by attacks using alternative side channels, Specularizer is inspired by the following key observations: Although speculative execution attacks may leverage a variety of micro-architectural side channels (e.g., TLBs, caches) to leak secrets from speculatively executed instructions, the invariant of these attacks is the method with which the speculative execution can be triggered. In exception-based attacks, speculative execution is triggered by exceptions, which are either handled or suppressed; in misprediction-based attacks, speculative execution is triggered either by control-flow misprediction or by misprediction in the memory disambiguation. Therefore, Specularizer utilizes the inevitable execution patterns of exceptions and mispredictions as signatures.

We identify PT packets and HPC events that can reveal crucial information necessary for attack detection, such as control-flow transfers for exception handling and TSX aborts, mispredicted branch instructions, machine clears due to memory order conflicts, etc. While each type of PT or HPC record alone is insufficient for reconstructing all attack activities, collectively they offer greater insight into the micro-architectural level behavior of the monitored applications. Therefore, we develop techniques to combine HPC and PT data to construct execution traces. With these traces, we build classification models using the Long Short Term Memory (LSTM) network to perform the classification of attack and benign programs.

Specularizer consists of two components: an online trace collection component that is integrated into the operating system of a production machine, on which the monitored application runs, and an offline attack detection component that performs HPC and PT records parsing, trace processing, and trace classification, which are time-consuming and hard to finish in real-time. In fact, rarely do HPC or PT-based monitoring systems perform real-time analysis [14, 51]. As such, Specularizer is best suited for VM or container-based cloud systems, where suspicious workloads from untrusted cloud tenants are monitored on cloud servers and forensic analyses are performed on separate servers to detect attack activities. While deferred attack detection does not prevent the attacks from happening, it can trigger further investigation of attacks to identify their sources and assess their consequences.

We have implemented a prototype of Specularizer and evaluated its effectiveness and efficiency in a lab setting. Specifically, to evaluate Specularizer, we develop parameterized models for each type of the speculative execution attacks we aim to detect, and then automatically generate thousands of attack variants by tuning the parameters of these models. With the data sets collected from both benign and attack samples, the evaluation of Specularizer suggests it has promising detection accuracy while inducing reasonable performance overhead. The evaluation results also indicate that Specularizer significantly raises the bar for performing speculative execution attacks even if the attackers understand the detection mechanism.

Contributions. The paper makes the following contributions: ➀ Specularizer is the software tool that detects speculative execution attacks, by their triggers of speculative execution rather than specific covert channels. ➁ Specularizer provides new insights of combined use of multiple performance tracing hardware features, e.g., PT and HPCs, in the context of offline attack detection. ➂ The paper presents parameterized models of speculative execution attacks and methods to automatically generate attack variants with varying attack success rates. ➃ The paper presents a prototype implementation of Specularizer and empirically evaluates its selection of parameters, its effectiveness, and performance overhead.

2 Background

Speculative Execution Attacks. A speculative execution attack contains the following components [3]: Speculation primitive triggers speculative execution of instructions. Disclosure gadget transmits information through a side channel. Disclosure primitive reads the side-channel information that was transmitted by the disclosure gadget. As such, a speculative execution attack can be performed in the following steps: ➀ executes the speculation primitive to trigger speculative execution of instructions. ➁ utilizes the speculative instructions (including the speculation primitive itself) to access secrets across the security boundary; ➂ speculatively executes the disclosure gadget to encode the secret value into the cache states; ➃ uses the disclosure primitive to decode the secret data from cache states.

According to the speculation primitives, we classify speculative execution attacks into the following three categories [11]. Misprediction-based attacks leverage branch, Store-To-Load (STL), and memory-order buffer mispredictions as the speculation primitive and performs attacks before the correct target is resolved. Exception-based attacks and assistance-based attacks use exceptions (e.g. Page fault, General Protection fault, etc.) and microcode assists (e.g. line-fill buffer, store buffer, and load port conflict [1], etc.) as speculation primitive, respectively, and speculatively execute instructions before they are handled by the processor.

Performance Tracing Hardware. Intel PT is a hardware feature available in Intel processors since Broadwell. It is designed to record the information regarding the control-flow transfers of software programs with very low performance overhead. The PT hardware generates PT packets to reconstruct the timestamped control flow for a program [17, 45]. HPCs are a set of model-specific registers that can be used to count user-selected processor architectural or micro-architectural events. Each HPC register can be configured to count a specific event supported by the processor. At runtime, when the specified event happens, the corresponding HPC counter will be incremented.

The HPCs have two different approaches for software to collect event samples. First, when the performance monitor interrupt (PMI) is enabled in a specific counter, a PMI will be triggered when the counter overflows, which provides the software with an opportunity to handle the HPCs data [7]. However, the large volume of interrupts dramatically increases the performance overhead. Second, to address the performance issues, Intel introduces Precise Event-Based Sampling (PEBS), which can store the events in a buffer (dubbed Debug Store (DS) area). Only one interrupt is triggered when the buffer is almost full (determined by a threshold).

3 Threat Model and Specularizer  Overview

Threat Model. All misprediction-based and exception-based attacks are in-scope of this paper. Our method detects these two types of attacks by monitoring its execution of the speculation primitives, which are either a branch instruction that takes time to resolve its target address or a memory load that accesses data across the security boundary. We consider MDS and LVI attacks that are triggered by exceptions, which are the most common cases in current state-of-the-art attack examples, as exception-based attacks, and hence Specularizer will detect those attacks.

Specularizer Architecture. The overall architecture of Specularizer is shown in Fig. 1, which consists of two components: Online Trace Collection and Offline Attack Detection. Online Trace Collection is an online component that runs on a production system, running as system programs, which produces execution traces collected using PT and HPC. Offline Attack Detection is a component that runs offline that includes two parts (i.e. Trace Processing and Attack Detection), possibly on a separate machine, and performs analysis of the collected traces to identify speculative execution attacks.

Fig. 1.
figure 1

Architectural of Specularizer.

Online Trace Collection. To detect exception-based speculative execution attacks, Specularizer monitors exceptions using PT. When the attacks use TSX to suppress exceptions, PT packets can record TSX aborts; when the attacks handle the exceptions directly, PT packets can record control-flow transfers that correspond to exception handling.

To detect misprediction-based attacks, Specularizer needs to monitor the pattern of mispredictions, which includes misprediction in control-flow predictors (branch prediction units like BTB, PHT, and RSB) and data-flow predictors (the memory disambiguator in load/store buffers). However, PT is insufficient to monitor these micro-architectural events. HPCs are utilized instead. The limitation of using HPCs to monitor misprediction is that they are asynchronous with execution context, which is insufficient for detecting misprediction-based attacks. To address this problem, Specularizer utilizes Intel PT to provide the execution contexts.

Offline Attack Detection. During the execution of benign programs, exceptions, TSX transaction aborts, and misprediction in control-flow and data loading is normal. Therefore, we cannot simply detect speculative execution attacks using exception handling/suppressing and branch/data misprediction as signatures. Instead, patterns of exceptions and mispredictions must be learned from both benign and attack programs and utilized to detect attacks in the program to be monitored.

Fig. 2.
figure 2

Architectural and workflow of trace collection.

4 Trace Collection

The overall workflow of trace collection (shown in Fig. 2) is as follows: ① Specularizer enables PT and HPC to monitor the execution of the target program and specify the memory buffer to record the execution traces. ② When the memory buffer is full, an interrupt is triggered. ③ After replacing the full buffer with an empty one for the hardware to continue recording data, the interrupt handler sends a signal to the userspace data collector. ④ Upon receiving the signal, the data collector reads data from the full buffer. ⑤ Finally, the data collector saves the collected data into files.

Collecting Traces from HPC. To gain visibility into micro-architectural events, Specularizer activates HPCs to monitor branch mispredictions (e.g. direct, indirect branches) and machine clear events caused by memory order conflicts by activating the events BR_MISP_RETIRED.ALL_BRANCHES and MACHINE_CLEARS.MEMORY_ORDERING. Specularizer uses the PEBS to monitor the branch misprediction event and PMI to monitor the memory order conflict event, as the memory order conflict event is not available in PEBS mode. These two events are monitored simultaneously on different HPCs. When monitoring in the PMI mode, the overhead mainly comes from PMI handling. When monitoring in the PEBS mode, although the interrupts are significantly reduced, there are still two sources of overhead: First, writing each PEBS record into the DS area takes about 200 ns [8]. Second, DS-area-overflow interrupts need to be generated when the DS area is full (maximum size is 4 MB). Thousands of interrupts need to be generated during trace collection for one application.

Two performance optimization were implemented: First, Specularizer implements a ring buffer [51] to cache the data in the DS area. Specifically, Specularizer allocates two buffers for the DS area to reduce the overhead introduced by dumping data inside the interrupt handler. When the data in one of the buffers reaches the threshold, Specularizer switches the buffer used by the DS area upon receiving the interrupt. A signal is sent to the user-space component of Specularizer to dump data from the full buffer. Second, to reduce the number of generated PEBS records, Specularizer tunes the PEBS sampling rate (\(\rho \)), which indicates the fraction of events (\(1/\rho \)) sampled by PEBS to create PEBS records. \(\rho >1\) means PEBS are sampled less frequently with a higher performance overhead and hence some branch misprediciton information is missing. We will evaluate the impact of \(\rho \) on detection accuracy in Sect. 8.

Collecting Traces from PT. To collect control-flow transfer and timestamp information, Specularizer activates Intel PT by setting the following control bits of the MSR IA32_RTIT_CTL: TraceEn (to enable PT), BranchEn (to generate control-flow related packets, e.g., TNT, TIP & FUP), OS & User (to monitor both user-mode processes kernel threads), TSCEn, MTCEn & CYCEn (to generate timestamp related packets, e.g. TSC, MTC & CYC).

The overhead incurred in generating PT packets is negligible. The main overhead comes from handling the memory buffer that stores the PT packets when it is full. Unlike PEBS’s DS area, which has a fixed size (i.e., 4 MB), the memory buffer used by PT can vary. Specifically, PT uses a Table of Physical Addresses (ToPA) to store all generated packets, which is a linked list that links multiple output regions. Therefore, the total size of the ToPA is flexible, and the number of generated interrupts can be controlled to decrease the runtime performance overhead.

5 Trace Processing

Specularizer processes HPC events and PT packets offline, possibly on a machine that is different from the host that implements the Specularizer monitors. The exception-based output sequences are generated using PT traces only, and the misprediction-based output sequences are extracted with information from both PT traces and HPC records.

5.1 Processing Exceptions

Among the three approaches to tackling exceptions in exception-based speculative execution attacks, namely handling exceptions, suppressing exceptions with TSX, and suppressing exceptions with branch misprediction, the first two cases trigger an indirect control-flow transfer. Therefore, Specularizer extracts exception-triggered control flow transfers in collected PT records. The third case is categorized as misprediction-based and discuss later.

Extracting Addresses of Exceptions. When the exception is handled by exception handlers, the control flow will transfer from user space to kernel space. With the PT packets, we can extract all kernel traces—a sequence of instructions in the kernel space. Afterwards, by comparing those traces with the kernel symbol table, the kernel traces can be used to identify different types of exceptions.

When the exception is suppressed by TSX transactions, the exception type is not revealed through kernel traces. Nevertheless, the exception is recorded by the MODE packet which has a field called TXAbort, with its value as 1. The addresses of the instructions that trigger TXAbort are recorded by the FUP packet that follows.

Extracting Timestamps of Exceptions. PT can be used to recover the timestamp of exceptions, as PT records the following time-related packets: Timestamp Counter (TSC) packets provide the wall-clock time (wc); Mini Time Counter (MTC) packets are generated periodically based on the core-crystal clock (ccc); a TMA packet is generated immediately after each TSC packet, with a common timestamp copy (ctc) value in its payload; a Cycle Accurate (CYC) packet is generated immediately preceding TIP packets and provides the accurate ctc value since the last CYC packet. To extract exception timestamp, Specularizer calculates ccc for each TIP packet based on the relationship between these time-related packets [7], as PT generates a TIP packet when an exception is raised.

Output. Specularizer analyzes each PT trace offline, identifies and records all exceptions, the virtual address of the instruction that triggers it, as well as the timestamps of the identified exceptions. Two parameters, \(\delta \) and \(\mu \), were involved in the data output: PT traces are segmented into windows of \(\delta \) CPU cycles, and the attack detection algorithm runs over the traces in each window.

The output of this step is a set of sequences of two tuples, which is denoted as \(X_{e_k} = [(c_1, t_1), (c_2, t_2), \cdots , (c_n, t_n)]\), where \(e_k\) is the virtual address of the instruction that triggers the exception, \(c_i\) indicates whether exists an exception of the \(i^{th}\) occurrence of the virtual address \(e_k\), \(t_i\) is the timestamp of its occurrence, and \(\mu \) is the length of each sequence, which is the input of attack detection model in Sect. 6. For each \(\delta \)-cycle window, one or multiple sequences are gathered: if the total number of exceptions is greater than \(\mu \), a new sequence is created; a sequence less than \(\mu \) is padded to \(\mu \) with (0, 0). We will evaluate the impact of different values of \(\delta \) and \(\mu \) on the effectiveness of the detection algorithm in Sect. 8.

5.2 Identifying Branch and Data Misprediction

Specularizer identifies branch and data misprediction from the recorded HPC events. Particularly, Specularizer first extracts the timestamp of each misprediction event from the HPC records, then extracts the timestamp of each branch instruction from the PT traces. Finally, by aligning the timestamp information from the HPC records and PT traces, Specularizer outputs traces of correctly predicted and mispredicted branches for attack detection.

HPC Records Parsing. Specularizer parses the HPC records and identifies the records that are related to either branch misprediction or data misprediction, and then outputs a sequence of two tuples: \([(c_1, t_1), (c_2, t_2), \cdots , (c_n, t_n)]\), where \(c_i\) is the event (i.e., the branch misprediction or data misprediction) of the ith occurrence of the misprediction and \(t_i\) is its timestamp. The accuracy of misprediction information could depend on the PEBS overflow threshold \(\rho \) discussed in Sect. 4.

PT Trace Reconstruction. Specularizer first reconstructs the program execution trace and the timestamp value of each branch with packets generated by the PT hardware. Meanwhile, PT timestamp packets are used to reconstruct the timestamp of each branch instruction using the method described in Sect. 5.1. By combining program execution trace with the timestamp information, Specularizer outputs a sequence of two tuples: \([(b_1, t_1), (b_2, t_2), \cdots , (b_n, t_n)]\), where \(b_i\) is the virtual address of the ith occurrence of the branch and \(t_i\) is the timestamp when the branch is executed.

HPC and PT Alignment. Specularizer aligns HPC records with the control-flow transfer information collected from PT to attribute HPC records to a specific branch of the program. The alignment can be performed by matching the timestamp value \(t_i\) in the two sequences. Particularly, for each element \((c_k, t_k)\) in the HPC sequence, we search the PT sequence to find an element with index i that satisfies \(t_{i} \le t_{k} < t_{i+1}\). Then we associate \((c_k, t_k)\) with \(b_i\).

Output. For each \(\delta \)-cycle window, each branch instruction \(b_k\), Specularizer outputs a set of sequences of two tuples, which is denoted as \(X_{b_k} = [(c_1, t_1), (c_2, t_2), \cdots , (c_n, t_n)]\), where \(t_i\) is the timestamp when the \(i^{th}\) execution of the branch \(b_k\), \(c_i\) indicates whether there is a misprediction and the misprediction type (i.e. branch or data) in the \(i^{th}\) execution of this branch, and \(\mu \) is the length of the sequences.

6 Attack Detection

Given the traces produced in the previous section, Specularizer uses the LSTM to extract the temporal information of the traces for attack detection. Specularizer uses four detection models to detect four different attack types, which are exception-based attacks, misprediction-based attacks exploiting BTB/PHT, RSB, and memory disambiguator, respectively. These four detection models share the same layout: one LSTM layer and one Dense layer. Particularly, the detection model inputs the traces to the LSTM layer and outputs the likelihood for the trace to be an attack (between 0 and 1) from the Dense layer.

An end-to-end construction of Specularizer, therefore, works as follows: (1) for every program monitored, both HPC and PT traces are collected and processed; (2) all processed traces for the program are classified by all four models. If one of the models classifies any of the traces as “attack” with a likelihood higher than a threshold \(\alpha \), the program is labeled by Specularizer as performing speculative execution attacks.

7 Attack Variants Generation

To systematically evaluate how accurate Specularizer can detect speculative execution attacks, we produce a data set of attack variants. To do so, we first propose parameterized models for speculative execution attacks and then systematically tuning the parameters of these models to generate a set of attack variants.

Fig. 3.
figure 3

Attack models for (a) exception-based attacks and (b) misprediction-based attacks.

7.1 Exception-Based Attack Variants

Modeling Attacks. The attack model of exception-based speculative execution attacks is described in Fig. 3 (a), which depicts the timestamps of exceptions that happened at a specific virtual address of the monitored program; each dot on the timeline represents the occurrence of an exception. \(N_a\) is the number of exceptions in a cluster that any two consecutive exceptions are no more than \(L_a\) cpu cycles apart.

Table 1. Relationship between \(N_a\) and \(p\) in exception-based attacks.

To understand the practical implication of \(N_a\) and \(L_a\), we performed an empirical evaluation of these two parameters using the Proof-of-Concept (PoC) code provided by Canella et al.  [11]. We executed each of the PoC 10, 000 times when the system is idle and report the relationship between minimum \(N_a\) and the success rate (\(p\)) in Table 1. When the system is busy, the \(N_a\) increases for the same \(p\). Therefore, we only present the data when the system is idle in Table 1. As we see from the result, when utilizing different speculation primitives, to have \(p\) \(\ge 95\%\), \(N_a\) ranges from 1 to 1,300,000.

We also measured the relationship between \(L_a\) and \(p\). When \(FLUSH+RELOAD\) is selected as the disclosure primitive, it takes at least 150, 000 CPU cycles to finish reloading 255 elements (the minimum for encoding one byte). Therefore, with \(N_a\) \(= 100\), we select \(L_a\) from 150K, 250K, 350K, 450K, 550K, 650K, 750K cycles. The experiment results suggest that the variation of \(L_a\) does not have an observable effect on \(p\).

Generating Attack Variants. For each type of speculation primitives (e.g. #PF, #GP, etc.), we generate one attack variant for each of the following 23 value ranges for \(N_a\): {[1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 20], [21, 30], [31, 40], [41, 50], [51, 60], [61, 70], [71, 80], [81, 90], [91, 100], [1,000, 10,000], [10,001, 100,000], [100,001, 1,000,000], [1,000,001, 2,000,000]}. In all attack variants, \(L_a\) was chosen from 150K, 250K, 350K, 450K, 550K, 650K, 750K cycles. For each variant, \(N_a\) was chosen uniformly at random within the corresponding range.

Therefore, in total \(9\times 7 \times 23=1449\) attack variants were generated. Then we created 3 separate data sets from these samples. Specifically, we first selected three thresholds (i.e., \(85\%\), \(90\%\), and \(95\%\)) for the attack success rate \(p\), as attack variants with low \(p\) are meaningless, which will be discussed in Sect. 8. Second, for each \(p\) and each type of speculation primitives, we determine the minimum \(N_a\) such that attack variants with equal or greater \(N_a\) yield attack success rates larger than the corresponding \(p\) (from Table 1). As such, the three data sets have 476, 448, and 399 attack variants, respectively.

Table 2. The success rate of misprediction-based attacks.

7.2 BTB/PHT Misprediction Variants

Modeling Attacks. To perform a successful misprediction-based speculative execution attack against BTB (e.g., Spectre-BTB) and PHT (e.g., Spectre-PHT), one needs to train (poison) the prediction unit in a loop multiple times before performing the attack to retrieve one byte of data [23]. This training can be performed either from the same address space or cross different address spaces [11]; moreover, the training can be performed either in-place or out-of-place [11]. Our detection target is the process that performs the training, regardless of whether it aims to perform same-address-space/cross-address-space or in-place/out-of-place attacks.

Therefore, the attack model of different types of misprediction-based speculative execution attacks is described in Fig. 3(b), which depicts the timestamps of branch/data prediction happened at a specific virtual address of the monitored program; each dot on the timeline represents the occurrence of one prediction. There are four parameters: \(N_a\) is the total number of attack attempts, \(L_a\) is the time interval between an attack attempt and the next training phase (in cpu cycles), \(N_t\) is the number of training attempts in each training phase, and \(L_t\) is the time interval between two consecutive training attempts.

To measure the parameters of the model, we used the PoC from Kocher et al.  [23] and Canella et al.  [11]. First, we tested the relationship between the occurrence of branch misprediction when the “attack” is performed and the success rate of the attack by leveraging the HPC event. The result shows that whenever the branch misprediction occurs, the attack can always have a \(100\%\) attack success rate. This is because the speculative window caused by BTB/PHT misprediction is large enough to load secret into the microarchitecture [52]. Therefore, an occurrence of a branch misprediction is equivalent to a successful attack.

Next, we evaluate how \(N_t\) and \(L_t\) affect the success rate of triggering branch misprediction. In the experiments described below, \(N_a\) \(=1\) and \(L_a\) \(=150\)K cycles, and the result is shown in the BTB/PHT portion of Table 2. Each \(L_t\) is the CPU cycles (starting from the minimum value 350) and \(N_t\) enumerates each integer between 1 and 7 (inclusive). Each number in the table is the attack success rate in 10, 000 trials. As we see from the table, when \(L_t\) \( \ge 450\) and \(N_t\) \( \ge 2\), \(p\) is greater than \(95\%\).

Finally, we evaluated how \(N_a\) and \(L_a\) affect the attack success rate (\(p\)). In these experiments, we set \(N_t\) \(=1\) and \(L_t\) \(=350\) cycles, because this pair of \(N_t\) and \(L_t\) has the worst \(p\), which is the best scenario for analyzing the effects of \(N_a\) and \(L_a\). The result in Table 2 shows that larger \(N_a\) has greater \(p\). \(L_a\) has very little impact on \(p\): For \(L_a\) between 250k and 550k CPU cycles, \(p\) is greater than \(95\%\) when \(N_a\) \( > 30\). When \(L_a\) is large enough (e.g. \(L_a\) \( > 450\)k CPU cycles), \(L_a\) has no observable effect on \(p\).

Generating Attack Variants. For each type of speculation primitives (e.g., BTB sa-ip, PHT ca-ip, etc.), we generate one attack variant for each combination of \(N_t\), \(N_a\), \(L_t\), and \(L_a\). The values of \(N_t\) and \(N_a\) are sampled uniformly at random from the following 14 value ranges: {[1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 20], [21, 30], [31, 50], [51, 100]}; The values of \(L_t\) are chosen from {350, 450, 550, 650, 750} CPU cycles; and the values of \(L_a\) are chosen from: {150K, 250K, 350K, 450K, 550K} CPU cycles. Therefore, in total \(14 \times 14 \times 5 \times 5 \times 2 \times 4 = 39,200\) attack variants were generated. With the similar approach described in Sect. 7.1, we created 3 separate data sets with 37,904, 37,544, and 36,968 attack variants, respectively.

7.3 RSB and STL Misprediction Variants

Modeling Attacks. Spectre-RSB [11] and spectre-STL [11] exploits RSB and the memory disambiguator to trigger misprediction. In these two attacks, because RSBs can be poisoned by push and pop instructions, which is difficult to monitor using HPC and PT, and the memory disambiguator can be triggered simply by load instructions, which does not need training phase. Therefore, we use the model described in Fig. 3(a) to model these attacks.

To measure the impact of the parameters of the model on the success rate of the attacks, we used the PoC released with the published paper [11]. Using HPC events, we tested the relationship between \(p\) with \(N_a\) and \(L_a\), respectively. Tested \(L_a\) and \(N_a\) start from the minimum ones, 150K and 1, respectively. The results are shown in Table 2. For Spectre-RSB, the value of \(N_a\) must be greater than 10, 000 for \(p\) to be larger than \(90\%\). For Spectre-STL, the value of \(N_a\) must be greater than 20, 000 to achieve a similar success rate. For both attacks, \(L_a\) does not seem to play a significant role.

Generating Attack Variants. For each of RSB and memory disambiguator, we generated one attack variant for each of the following 15 value ranges for \(N_a\):{[1, 10], [11, 100], [101, 1,000], [1,001, 2,000], [2,001, 3,000], [3,001, 4,000], [4,001, 5,000], [5,001, 6,000], [6,001, 7,000], [7,001, 8,000], [8,001, 9,000], [9,001, 10,000], [10,001, 20,000], [20,001, 30,000], [30,001, 50,000]} and 5 values for \(L_a\): {150K, 250K, 350K, 450K, 550K}. In each variant, \(N_a\) was chosen uniformly at random during run-time with the corresponding value range. Therefore, in total \(15 \times 5 \times 2 = 150\) attack variants were generated. With the similar approach described in Sect. 7.1, we created 3 separate data sets with 21, 20 and 14 attack variants, respectively.

8 Evaluation

In this section, we evaluate the detection accuracy and performance of Specularizer. The data sets used in the evaluation are collected in the following approaches: The benign programs are selected from GNU BinutilsFootnote 2 and SPEC benchmark 2006. The attack samples are drawn from the attack variants discussed in Sect. 7. The experiments were conducted on desktops with Intel Core i7-7700 Processors and 32 GB RAMs. 64-bit Ubuntu 16.04.6 LTS operating systems with the kernel version 5.4.0 were installed on the desktops.

8.1 Evaluation of Specularizer ’s Parameters

There are a few parameters that can be tuned for Specularizer: ➀ In the collection phase, the PEBS sampling rate (\(\rho \)) specifies the accuracy of branch misprediction records. We collected traces with 4 different \(\rho \) values: 1, 3, 5, 10. ➁ In the trace processing phase, the window size \(\delta \) and trace length \(\mu \) determine how the collected HPC and PT data are segmented for the LSTM algorithm to work on. We particularly picked two window sizes \(\delta \), 10 million CPU cycles and 100 million CPU cycles, and two trace lengths, 500 and 1000 data points. ➂ The parameter we use to select training data set is the success rate \(p\) of the attack variants, which can be chosen from 85%, 90%, and 95%.

Table 3. Data sets for parameter evaluation.

In this section, we analyze how these parameters affect the detection results. We created 8 data sets, whose parameter configuration is shown in Table 3. Each data set contains four groups of traces; each group is used to evaluate one LSTM model, as specified in Sect. 6. In each group, around 30, 000 benign traces and 30, 000 attack traces were collected. Then the traces in each group are randomly split into the training set (\(80\%\)) and a testing set (\(20\%\)).

By running the LSTM classification, the algorithm outputs a class label (“benign” or “attack”) for each trace together with the likelihood between 0 and 1. We selected a threshold \(\alpha \) of the likelihood, such that Specularizer alerts the detection of an “attack” trace when the LSTM classifier outputs “attack” with a likelihood greater than \(\alpha \). Two values were selected for \(\alpha \), 0.5 and 0.75. The accuracy is evaluated using the F1 scores when \(\alpha = 0.5\) and when \(\alpha = 0.75\). A high F1 score suggests a balanced precision and recall. Here recall is defined as the percentage of detected attack traces in all attack traces and precision is defined as the percentage of correctly detected attack traces in all detected attack traces.

PEBS Sampling Rate \(\rho \). We only evaluated \(\rho \) for misprediction-based attacks, because the detection of exception-based attacks does not need HPC. The data sets we used in this test are (2), (4), (5), and (6) (as shown in Table 3), where the window size (\(\delta \)) is selected as 100 million cycles, trace size (\(\mu \)) is selected as 1000, and \(p\) \( > 95\%\) for attack variants selected in the training/testing set. The result is shown in Table 4. We see from the table that \(\rho \) only affects the detection of BTB/PHT-based attacks. Specifically, only when \(\rho \le 3\), F1 scores yield good detection accuracy (F1 score greater than \(90\%\)). In contrast, regardless of the \(\rho \) value, the detection accuracy for RSB and STL-based attacks is high. This is because losing branch misprediction information due to larger \(\rho \) values is more critical to detecting attacks that require training.

Window Size \(\delta \). To evaluate the effect of \(\delta \), we used data set (1) and (2), where for both data sets \(\rho =1\), \(p\) \(> 95\%\), and \(\mu =1000\). For each window size (\(\delta \)), we evaluate the F1 score when the threshold is 0.5 and 0.75 and the result is shown in Table 5, which suggests \(\delta \) does not have a strong impact on the detection accuracy.

Table 4. Impact of PEBS sample rate \(\rho \).
Table 5. Impact of window size \(\delta \), trace length \(\mu \) and threshold of attack success rate \(p\).

Trace Length \(\mu \). The evaluation utilized data set (2) and (3), with \(\rho =1\), \(p\) \(\ge 95\%\) and \(\delta =100\)M cycles. The result is presented in Table 5, which means \(\mu =500\) or \(\mu =1000\) does not affect the detection accuracy dramatically.

Success Rate Threshold \(p\) . The data sets used in this evaluation are (2), (7), and (8), with \(\rho =1\), \(\mu =1000\) and \(\delta =100\)M cycles. The result shown in Table 5 suggests that \(p\) does not have much impact on the detection accuracy.

Classification Likelihood Threshold \(\alpha \). The result shown in Table 4 and Table 5 suggests that \(\alpha = 0.5\) or \(\alpha = 0.75\) does not affect F1 score. Thus, we chose \(\alpha = 0.5\) for the following evaluation.

figure a

8.2 Evaluation of Detection Accuracy

We evaluated the detection accuracy of the LSTM models trained using data set (4) in Table 3. Using these parameters, in the following experiments, we evaluate the models’ capability of detecting various attack variants. Because the traces that are classified as benign all have a precision that is close to \(100\%\), the F1 score does not provide more information than recall, or true positive rate (TPR). Therefore, we use TPR as the metric for evaluating detection accuracy, which is defined as the percentage of correctly classified traces among all traces that are classified as attacks. The results are represented in Fig. 4, while the blue line is the TPR and the red line is the attack success rate \(p\). In the cases where TPR > \(p\) means the probabilistic to detect the attack is higher than the secret been leaked.

Exception-Based Attacks. We collected 11, 700 traces from all types of exception-based variants we generated and split them into separate groups according to their \(N_a\) value. Then we perform classification on each of the groups and show the results in Fig. 4(a). In this figure, the X-axis is the value of \(N_a\), the red line is the attack success rate \(p\) and the blue line is the TPR. When \(N_a\) \(=4\), TPR \(=99.1\%\); when \(N_a\) \(>10\), TPR \(\ge 99.9\%\); but when \(N_a\) \(\le 3\), TPR drops to 0, which means we were not able to detect exception-based attacks with fewer than 4 attempts within a time window of 100 million CPU cycles.

Misprediction-Based Attacks on BTB/PHT. We collected 980, 000 traces from the BTB/PHT attack variants. To evaluate Specularizer with varying \(N_t\), \(N_a\), \(L_t\), and \(L_a\) values, we split the traces accordingly. The results are shown in Fig. 4 (b), Fig. 4 (d), Fig. 4 (c) and Fig. 4 (d), respectively. As we see from these figures, Specularizer can detect attack variants with \(N_t\) \(\ge 2\), \(N_a\) \(\ge 3\), \(350\le \) \(L_t\) \(\le 750\) CPU cycles, and 150K \(\le \) \(L_a\) \(\le 550\)K CPU cycles with TPR \(\ge 90\%\).

Misprediction-Based Attacks on RSB and Memory Disambiguator. We collected 7, 500 traces from attack variants exploiting RSB and memory disabmiguators. To evaluate Specularizer with varying \(L_a\) and \(N_a\) values, we split the traces accordingly. The results are presented in Fig. 4 (g), Fig. 4 (i), Fig. 4 (f), and Fig. 4 (h), respectively. As we see from these figures, Specularizer can detect attack variants with 150K \(\le \) \(L_a\) \(\le 550\)K cpu cycles with TPR \(> 80\%\). TPR increases almost monotonically when \(N_a\) increases. Specularizer can detect attack variants when \(N_a\) \(>3000\) with TPR \(>80\%\). It is worth noting that when TPR \(< 80\%\) for both attacks, the success rate of these attacks goes below \(30\%\), which suggests that the adversary needs to balance the attack efficiency with the risk of detection.

figure b
Fig. 4.
figure 4

Accuracy of attack detection (recall). (a) \(N_a\) in exception-based attacks, (b) \(N_t\) in BTB/PHT misprediction attacks, (c) \(L_t\) in BTB/PHT misprediction attacks, (d) \(N_a\) in BTB/PHT misprediction attacks, (e) \(L_a\) in BTB/PHT misprediction attacks, (f) \(N_a\) in RSB misprediction attacks, (g) \(L_a\) in RSB misprediction attacks, (h) \(N_a\) in STL attacks, (i) \(L_a\) in STL attacks. (Color figure online)

8.3 End-to-End Evaluation

In practice, Specularizer monitors the execution of a program and raises alarms if any of the traces collected from the program is detected as “attacks”. To perform end-to-end evaluation, we use the same model as trained using data set (4).

The data set we used has 26 benign programs collected from GNU Binutils and SPEC benchmark 2006, and randomly selected 160 attack variants we generated (i.e. 40 variants for each attack type). Each of the 186 programs was examined using all the four LSTM models. Among the 41 benign programs, only one (gobmk) is falsely classified as BTB/PHT misprediction attacks and four benign programs (i.e., ld, perlbench, sophlex, and gobmk) were misclassified as exception-based attacks. However, in all these misclassified cases, less than 3 traces (out of over 1000 traces) extracted from each program were indeed misclassified, which means these false detections can be prevented if Specularizer only raises alarms when multiple traces (e.g., > 3) were detected as attacks, which can be another parameter the user of Specularizer could tune. Nevertheless, all attack variants are successfully detected by their corresponding LSTM classifier. The BTB/PHT classifier also detects 117 out of 120 other attack variants, because these attack variants also exhibit this type of branch misprediction.

8.4 Performance Analysis

Overhead of Online Trace Collection. In our experiments, Specularizer enabled trace collection of both HPC and PT, with the HPC events and MSR configurations specified in Sect. 4. The \(\rho \) was set to 3. The experiments on LMbench [29] show the runtime overhead on I/O is negligible. The results of the SPEC benchmark are shown in Fig. 5. The runtime overhead was introduced from \(0.038\%\) to \(231.42\%\), with a geometric mean of \(14.36\%\). Some of the benchmark programs (e.g. mcf, gobmk, and sjeng) had high performance overhead; as their execution triggers a lot of branch misprediction. We note that the performance can be reduced with Intel’s new feature that redirects PEBS’s sampling output to PT packets [7], as PT packet generation introduces much less overhead [16]. We leave this evaluation to future work.

Fig. 5.
figure 5

The performance overhead of online trace collection.

Fig. 6.
figure 6

Running time of offline attack detection.

Running Time of Offline Attack Detection. Figure 6 shows the running time of offline attack detection. The number above each bar is the average running time (of 1000 trials) of the offline analysis for each SPEC benchmark (in seconds), which ranges from 3 s to 1709 s with PT trace files between 13M bytes and 13G bytes. More specifically, the offline analysis includes three phases: trace loading, trace processing, and attack detection. On average, they account for \(70.01\%\), \(29.85\%\), and \(0.14\%\) of the entire running time. One reason for the long-running time for trace loading/processing (\(99.86\%\)) is that PT generates a large number of packets, which takes a long time to parse and analyze. The attack detection phase typically takes less than 1s. Finally, it is worth noting our offline analyses were performed within a single thread with limited memory, which can be further optimized using multi-threading and larger memory. And for applications such as forensics, the overhead of offline analysis is not critical.

9 Discussion

Detecting Assistance-Based Attacks. Microcode assist was exploited in some recent works [31, 37, 42, 43]. However, there is no systematic study of these microcode assists yet. It is not clear how many methods can trigger microcode assists and how many of them can be exploited in speculative execution attacks by unprivileged programs. Without such systematic exploration, an ad-hoc detection technique is likely to be bypassed. We leave the detection of assistance-based attacks to future work.

Completeness of the Attack Data Sets. We could hardly claim that our generated attack data sets cover all possible attack variants. However, as the models used for attack variants generation only specify the patterns of misprediction and exception, they are general enough for modeling attacks that use different types of disclosure gadgets and disclosure primitives. Moreover, the parameters in the attack models can be tuned to alter specific properties of an attack variant, which in combination can be used to approximate most attack methods one could think of.

Using Simpler Classification Models for Attack Detection. One might think deep learning algorithms like LSTM are too heavyweight for our scenarios. In fact, we have also tested multiple alternatives, such as decision trees, K-means, random forest, etc. However, we found those models very fragile for any practical use. In contrast, LSTM offers an automated selection of parameters and thresholds, greatly reducing the subjectivity in the selection of classification models.

Adversarial Machine Learning (AML). Specularizer is vulnerable to AML-based techniques that generate carefully crafted attack variants to evade detection. As shown in Fig. 4, in general, attack code that evades detection is likely to have a lower success rate. In that sense, Specularizer makes speculative execution attacks harder to perform, but may not eliminate the threats. However, we note this arms race is common in all machine-learning-based defense systems [14, 32].

Real-Time Attack Detection. Ideally, attack detection should be performed in real-time and for all programs. However, as parsing PT packets and processing the traces are time-consuming (as shown in Fig. 6), it is very challenging to do so in practice. Moreover, enabling whole system monitoring with PT will drastically increase the overhead of trace parsing and analysis. These are common issues for PT/HPC-based monitoring systems [39, 51].

10 Related Work

Detecting Speculative Execution Attacks. Prior works on detecting speculative execution attacks mainly focus on the detection of disclosure primitives, such as the Flush+Reload cache side channels [19]. In contrast, Specularizer detects the speculative execution attacks by monitoring its root cause—the speculation primitives. Close to our work is due to [25, 44] who also leverages HPC to detect speculative execution attacks. However, as their approach only uses HPC, it omits the context of program execution in the detection of attacks. Therefore, their approach is less accurate and only applicable to simple proof-of-concept attacks.

Mitigating Speculative Execution Attacks. Software solutions provide temporary mitigation of the threats, which are reactive to only known attacks and ad hoc. For instance, page table isolation (e.g., KPTI of Linux) PTE inversion, and L1d flush [5], compiler-based mitigation [12, 23, 30, 40] provides generic solutions for exception-based and misprediction-based speculative execution attacks. SPECCFI [24], ConTExT [36] mitigates a specific type of speculative attack. Furthermore, many works focus on detecting the code gadget of speculative execution attacks [13, 18, 20, 28, 33, 47].

Proposals from the computer architecture research community mitigate speculative execution attacks with more dramatic revision on the micro-architectural level [2, 4, 6, 9, 10, 15, 21, 22, 26, 34, 35, 38, 43, 46, 48,49,50]. While these approaches may be efficient in addressing the targeted problems, however, it may take a longer time before these academic proposals can be adopted by the industry.

11 Conclusion

In this paper, we present Specularizer, a software tool for uncovering speculative execution attacks using performance tracing hardware features (PT and HPCs). Specularizer monitors the execution of the inspected applications in an online mode, introducing modest runtime performance overhead, and then performs attack detection in an offline analysis using LSTM networks. Empirical evaluation of Specularizer suggests that the proposed approach leads to high detection accuracy with reasonable overhead, particularly suitable for offline forensic analysis.