RV-TEE: secure cryptographic protocol execution based on runtime verification

Analytical security of cryptographic protocols does not immediately translate to operational security due to incorrect implementation and attacks targeting the execution environment. Code verification and hardware-based trusted execution solutions exist, however these leave it up to the implementer to assemble the complete solution, imposing a complete re-think of the hardware platforms and software development process. We rather aim for a comprehensive solution for secure cryptographic protocol execution, which takes the form of a trusted execution environment based on runtime verification and stock hardware security modules. RV-TEE can be deployed on existing platforms and protocol implementations. Runtime verification lends itself well at several conceptual levels of the execution environment, ranging from high level protocol properties, to lower level checks such as taint inference. The proposed architectural setup involving two runtime verification modules is instantiated through a case study using a popular web browser. We successfully monitor high and low level properties with promising results with respect to practicality.


Introduction
It is standard cryptographic practice to establish provable security guarantees in a suitable theoretical model, abstracting from implementation details. However, security of any cryptographic system needs to be holistic: over and above being theoretically secure and implemented in a secure way, the operation of a protocol also needs to be secured. While there exists a lot of research on the theory and general implementation aspect of cryptographic systems, its longterm operation security, albeit heavily studied, is not so well established. Evidence for undesirable consequences stemming from this state of affairs is unfortunately way too frequent, with several high profile incidents making the information security news in recent years. Insecure execution spans improper implementation related to specific protocol issues to more generic insecure programming practices. While the notorious Heartbleed OpenSSL vulnerability [57], for example, was caused by a memory corruption bug in its C source code, OpenSSL's timing attacks on the underpinning ciphers [8] are examples of how design security can be broken in implementation. Similarly, Bluetooth Smart's attack [74] was related to complexities with getting elliptic curve cryptography secure implementation right. Even once programming hurdles are addressed, issues arising at the platform level are a stark reminder that secure execution of cryptographic protocols is a hard problem. Insufficient physical randomness employed by certificate generation [32] is emphasized when large-scale generation for millions of IoT devices is carried out. Operating system features can be misused by malware campaigns, e.g., TrickBot [36], to inject code into web browsers and steal all their cryptographic secrets. Even when these attack vectors are closed down, secure protocol execution can still be undermined by hardware side-channels, with Meltdown and Spectre [35,40] shaking up the systems security landscape in the last two years.
A runtime monitor [16] can check the actual physical leakage of the system (in a selected model), verify formal conditions on inputs and outputs of primitive algorithms, as well as detect and prevent unusual use of the system (such as too many executions in some time window). The runtime verification approach can thus provide heuristic tools that can strengthen the implementation against existing, but also against as yet unknown (future) attacks of various types. A standalone monitor can also be more easily changed or upgraded than a complex cryptographic protocol. Furthermore, different types of monitors (of varying cost) can be employed according to expected security risks posed by the environment.
In this paper we propose RV-TEE, a comprehensive solution based on runtime verification (RV) at different levels of the implementation: from the low-level bugs and attacks, to data leaks, up to implementation issues at the protocol level. The end result is a Trusted Execution Environment (TEE) that is able to isolate security-critical code from potentially malware-compromised, untrusted, code. We propose that as an alternative to switching to specialized TEE hardware, the same secure execution environment can be provided through the use of hardware security modules (HSM), that extend existing stock hardware. RV's role is two-fold: It firstly provides the all-important runtime service of verifying correct protocol implementation, ensuring that design-level security properties are not broken. Secondly, it fulfills the role of a secure monitor that scrutinizes data flows crossing the TEE's trust boundaries. Overall we make the following contributions: -We show how RV in conjunction with HSM can be used to securely execute cryptographic protocols, both in terms of correct implementation as well as resilience to malware infection. Most importantly our approach only requires extending, rather than replacing, existing stock hardware. -We demonstrate the feasibility of our approach on realworld web browser code, both in terms of monitoring the correct execution of a third party ECDHE protocol implementation, as well as practical execution overheads. -We also present quantitative results regarding the use of RV for taint inference in combination with the SeCube hardware security module. The aim of this experiment to demonstrate RV-TEE's effectiveness in securely executing cryptographic primitives and in detecting data that might be attemptedly being exfiltrated outside the trust boundary. For this purpose we simulate a real-world banking trojan attack. This contribution is novel and has not appeared in our workshop paper [18].
This paper is organized as follows: Section 2 presents existing RV and hardware-based methods to complement models for theoretical protocol security, while Sect. 3 describes RV-TEE, our comprehensive approach for protocol operational security. Sections 4 and 5 present results obtained from a feasibility study on the Firefox web browser; the sections present high level and low level RV setups respectively. Section 6 presents additional results with respect to HSM employment and a banking trojan case study. Section 7 concludes by presenting a way forward as guided by this initial exploration.

Background and related work
Cryptographic protocols are designed to withstand a broad range of adversarial strategies. Standard practice is to rely on formal security models, defined in a dedicated way for a specific cryptographic task at hand (e.g., public-key encryption, pseudo-random generation, signing, 2-party key establishment, etc.), and succinct definitions are given making explicit the exact scenario in which a security proof (or reduction) is meaningful. In the case of key establishment, significant work has been done for over twenty years in the direction of dedicated security models (see [45] for a comprehensive overview).
Subsequent work has focused on specific scenarios (e.g., attribute based, see [71]) or advanced security goals (e.g. considering malicious insiders [11], aiming at strong security [77], preventing so-called key compromise impersonation resilience [25], etc.). Many of the attack strategies considered in the latter may actually be deployed on the implementation at runtime.
While having formal models to prove security protocols safe is a crucial first step, there are several things which may still go wrong in the implementation at runtime: To start with, the implementation might not be faithful to the proven design. Secondly, the implementation involves details which go beyond the design -these may all pose problems at runtime, ranging from low-level hardware issues, to side-channel attack vulnerabilities, to insecure execution contexts resulting from general-purpose operating system features that are prone to malware abuse.
To reason about the various kinds of security threats and how we deal with them through our proposal, we loosely classify them under four levels: High level These are logical bugs causing the protocol implementation to deviate from the (typically theoretically verified) design. Medium level At this level, we include malware attacks: The protocol implementation might seem to follow its design and yet such attacks might nonetheless manage to reach their tar-get, e.g., to exfiltrate data, by attacking the execution runtime rather than the protocol's implementation per se. Low level We classify under this heading threats originating from programming bugs e.g., causing secret information to be deducible from the outside, or resulting in undefined behavior such as arithmetic overflows, undefined downcasts, and invalid pointer references. Hardware level Finally, hardware can pose a threat if the manufacturer cannot be trusted or due to its susceptibility to side-channel attacks.

Runtime verification
Runtime verification (RV) [15,37] involves the observation of a software system -usually through some form of instrumentation -to assert whether the specification is being adhered to. There are several levels at which this can be done: from the hardware level to the highest-level logic, from module-level specifications to system-wide properties, and from point assertions to temporal properties. In all cases, the advantage of applying RV techniques is twofold: On the one hand, monitors are typically automatically synthesized from formal notation to reduce the possibility of introducing bugs, and on the other hand, monitoring concerns are kept separate (at least on a logical level) from the observed system.
The novelty of this paper complements existing work in applying RV to the security domain, specifically by providing a comprehensive solution for implementation security of cryptographic protocols, comprising: i) verification of correct protocol implementation; and ii) an RV-enabled Trusted Execution Environment (TEE) requiring minimal hardware. In what follows we loosely classify existing RV works on security protocols according to the threat level they address. High level At the highest level of abstraction, a number of approaches [4,68,69,81] check for properties directly derived from the protocol design (which would have been checked through the security model). This approach ensures that even though the protocol would have been theoretically verified, the implementation does not diverge from the intended behavior due to bugs or attacks.
An example of a temporal property in this category taken from TLS protocol verification [4] is before any data is sent by the client, the server hash is verified to match the client's version. This can be expressed in several formalisms. The one chosen in this case is LTL [56], which is a commonly used specification language in the RV community.
A second example (from [81]) is non-temporal but instead focuses on ensuring data does not leak to unintended recipients: If the operation is of type "Send", then the message receiver ID must be in the set of approved receiver IDs. In this case the property is expressed in an established RV framework called Copilot comprising a stream-based dataflow language.
Other specification formalisms used are timed regular expressions [68] for dealing with realtime considerations, state machines [69] when modeling of temporal ordering of events suffices, and signal temporal logic when dealing with signals [68]. Low level At a low level, RV techniques based on information flow can be used to check software elements which are not specific only to protocol implementations. Rather, such checks would be useful in the context of any application where security is paramount. For example, Signoles et al. [70] provide a platform for C programs, Frama-C, which can automatically check for a wide range of memory corruption vulnerabilities such as arithmetic overflows, undefined downcasts, and invalid pointer references. At this level, we also include Secure Flow [3] (a library within Frama-C) which protects against control-flow based timing attacks by monitoring information flow labels for all values of interest.

Trusted execution environments (TEE) and hardware security modules (HSM)
Besides typical RV use as outlined above (corresponding to the high level concerns), we propose leveraging RV for the provision of a trusted execution environment (TEE) to cover the medium level. The provision of a TEE is the ultimate objective whenever executing security-critical tasks [61], such as cryptographic protocol steps. Trusted computing finds its origin in trusted platform modules (TPM) that comprise tamper-evident hardware security modules (HSM) [72]. However, TPM constitute just one component of a complete TEE solution as depicted in Fig. 1. In fact, the cornerstone of TEE lies in the isolated execution of critical code segments in a way that they become unreachable by malware infections of the non-trusted operating system and application code. TPM are entrusted with booting an operating system (OS) environment that is segmented in a non-trusted and trusted domains respectively, ensuring the integrity of the boot process and at the same time protecting the cryptographic keys upon which all integrity guarantees rely on. The non-trusted domain corresponds to a typical OS that fundamentally provides security through CPU ring privileges. However the presence of software and hardware bugs along with inherently insecure OS features render malware infections possible at both the user and kernel levels. The crucial role of TEE comes into play when despite an eventual infection, malware is not able to interfere with security-critical code executing inside the trusted domain. Complete isolation is key, encompassing CPU, physical memory, secondary storage and even expansion buses. Code provisioning to the trusted domain as well as data flows between the two domains must be fully controlled in order to fend off malware propagation through trojan updates or software vulnerabil- ity exploits. These two requirements can be satisfied through TPM employment and a secure monitor that inspects all data flows crossing the trust domain boundary. A number of TEE extensions to CPUs (CPU-TEE) have already reached industry level maturity. Intel's SGX [47] and AMD's SVM [30] technologies are primary examples. These constitute hardware extensions allowing an operating system to fully suspend itself, including interrupt handlers and all the code executing on other cores, in order to execute the trusted domain code within a code enclave. Another widespread example is ARM's TrustZone [55] that provides a CPU-TEE for mobile device platforms. TrustZone implements the trusted domain as a special secure CPU mode, and which when transited from normal mode is completely hidden from the untrusted operating system, therefore allowing particular security functions and cryptographic keys to only be accessible when in secure mode. The Android keystore [19] is the most common functionality that makes use of this mode.
Several other ideas also originate from academia, such as the suggestion to leverage existing hardware virtualization extensions to implement TEE without having to resort to further specialized hardware [46]. Other work focus on providing practical solutions to port existing applications to a CPU-TEE. For example Haven [5] makes use of a library that exposes a subset of a windows API inside an Intel SGX enclave, enabling legacy applications to execute inside a CPU-TEE completely unmodified. While this approach may come across as too bloated for a secure enclave execution, recent work [76] showed that such bloating concerns are exaggerated. VC3 [63] offers a secure map-reduce cloud solution, also running on SGX, where the map/reduce code is submitted to the cloud service provider in an encrypted form and only gets decrypted and executed once inside the enclave. Another challenge with cloud computing is assuring that virtual machines (VMs) are not tampered with by malicious cloud service operators or tenants. Solutions such as CloudVisor [80] show that in such cases a TPM suffices to secure the booting process of guest VMs.
Despite all these efforts, it is important to note that CPU-TEEs are not attack-proof since practical threats targeting all the aforementioned hardware have already been demonstrated [34,62,66,79]. More importantly, when considering the adoption of CPU-TEE platforms for secure AGKE execution there is the major stumbling block of having to either make use of special hardware, with the consequence of OS modification requirements, or else having to execute unmodified OS code on top of a TEE-enabling hypervisor. Moreover, in all cases, the trusted code would have to execute without the support of an underlying operating system and therefore complicating the development process of trusted code.
The common denominator with all existing TEE platforms is the need for cryptographic protocol code to execute on special hardware. In contrast, we propose to achieve a similar level of assurance by combining RV with any hardware security module (HSM) of choice, encompassing high-bandwidth network cards with hardware accelerated encryption [73], down to smaller on-board micro-controllers and/or smartcards used in resource constrained devices [10,21]. Ultimately, even a CPU-TEE [30,47,55] can be used if deemed suitable. Compatibility-wise, if the design of the software to be secured already supports HSMs, e.g. PKCS#11, deployment even comes close to 'plug-and-play'. Ultimately, the level of protection with respect to tampering and resistance to side-channel attacks, of the adopted HSM is carried forward to RV-TEE.

Practical binary instrumentation
Binary instrumentation presents the pending primitive necessary to make RV work alongside the TEE. Specifically RV monitors must be able to track process memory content of the protocol execution to be secured. The most suitable type of instrumentation at the binary level is that of Dynamic Binary Instrumentation (DBI). Overall, DBI is a widely-adopted technique in the domain of software security, including the availability of widely used frameworks (e.g., Frida 1 and libdft 2 ) that simplify tool development in a programming language agnostic manner.
Addressing high level concerns, binary instrumentation is applied at the level of function call tracing. By leveraging various runtime structures that support program execution, e.g. imports/exports tables, as well as dynamic binary rerewriting, various practical applications can be attained by avoiding overheads associated with continuous stack frame creation and restoration. Such applications include malware sandboxes [78], end-point security monitors [53], cloud security monitoring [28], and patented application sandboxes [24].
Tracking of information flows presents the medium level option with dynamic taint analysis (DTA) being the predominant technique. DTA concerns which data flows are to be considered tainted due to their suspicious provenance, e.g., an input system call, and upon which a number of checks must be performed before them being passed onwards to sensitive sinks e.g., output system calls, or dynamically-created commands such as SQL or shell commands. Applications that rely on this technique are still highly experimental but carry sought-after potential to detect complex memory errors [14], protect from mobile malware [58], enable Advanced Persistent Threat (APT) attack detection and investigation [42], and provide data privacy assurance on the cloud [52], just to name a few.
The main limitation is presented by impractical overheads [29]. At its core, taint analysis requires the computation of a shadow state that identifies which data flows become tainted, propagate taint to other data objects, and at which point these objects should become untainted [64]. The shadow state itself presents memory overheads concerns, while its computations per program statements carries execution overheads. Moreover at the binary level, since the high level semantics of the source code are lost, the situation with runtime overheads 1 https://www.frida.re/. 2 https://github.com/vusec/vuzzer/tree/master/support/libdft. reverses as compared to function call tracing. Aggressive optimization techniques, revolving around efficient shadow state look-ups, avoidance of stack frame creation and register spilling, identification of redundant flows through static analysis and intermittent tracking [13,29,31], have demonstrated the possibility of bringing back overheads closer to compile-time taint analysis. However, even in this case slowdowns ranging between 1.5× and 3× are still considered prohibitive for on-line scenarios, beyond also missing on programming language independence and intermittent monitoring of the binary-level approach. One solution for practical DTA concerns inferring, rather than tracking, taint [67]. Taint inference takes a black-box approach to DTA, trading off between accuracy and efficiency. This method only tracks data flows at sources/sinks and then applies approximate matching in order decide whether tainted data has propagated all the way in-between. With slowdowns averaging only 0.035× for fully-fledged web applications, this approach seems promising. Furthermore, its binary-level implementation can leverage the same aforementioned techniques proposed for function tracing.
In certain cases, DBI may have to be complemented with its static alternative: Static Binary Instrumentation (SBI). SBI concerns modifying the executable file directly on disk before loading into memory for execution. In general SBI is complicated by the lack of execution context, and therefore knowledge of the original program per se, available to the instrumenter. In our case SBI is planned solely as an additional option for the instrumentation code injection step. Security-conscious applications nowadays implement increased security measures that may prohibit dynamic code injection, forcing instrumentation to occur statically through executable-header data structure manipulation [7].

Information-stealing malware
The kind of malware we consider for the medium threat level gets injected into victim processes and subsequently exfiltrates credentials or any other security-sensitive information. Once injected, malware defeats any kind of cryptography without having to break its mechanism per se. Rather, since cryptographic schemes assume secrecy of secret/private keys, through process injection information-stealing malware undermines this core assumption. The injection process itself may leverage overt OS features typically employed by debuggers, e.g., OpenProcess on Windows [33] and ptrace on linux [20].
More likely, in order to remain undetected by antimalware solutions, lesser known or even undocumented OS features are exploited instead. These are ones tucked beneath openly available inter-process communication mechanisms. On Windows, the NtQueueApcThread, NtMapViewOfSection and GlobalAddAtom internal system functions have been widely abused [33]. On linux, tampering with those data structures associated with the implementation of the POSIX signal call has been shown to provide a similar attack vector [20]. Mobile OSes, whilst relying on more restricted execution environments presented by locked-down devices, are still prone to similar attacks [41].
Threat intelligence reports categorize malware with information stealing characteristics under the following three headings: i) Memory scraping malware; ii) Credentials dumping malware; and iii) banking trojans. Primarily found in point-of-sale (PoS) terminals, memory scraping malware aims to steal sensitive data directly from PoS terminal memory, e.g., plaintext card details, through regular expressionbased signatures and subsequently harvesting them for card cloning purposes or similar abuse [27,60]. FighterPOS [75] and GlitchPOS [48]) are two notorious examples of this type of malware. On the other hand, credentials dumping malware is the PC version of PoS malware, with web browsers presenting common targets [43]. Actually, the target range is much wider, with any process that retains passwords, hashes or credentials of any form, e.g., session tickets, in memory presenting a potential target [51,54]. Notable examples include CStealer [1] and KPOT Stealer [65].
Finally, banking trojans are mass information stealing malware, typically also doubling as fully-fledged botnets, reacting to commands broadcast over command and control (C2) channels [9,22]. Zeus was one of the earliest banking trojans to rise to notoriety, followed by variants such as Citadel and Gameover Zeus, as well as other separate families including Dridex, Ursnif, Trickbot and Qakbot, that are still infecting machines up until very recently [44]. They tend to share advanced functionality, namely: client-side web page content injection (webinjects), key-logging, connectback functionality (stealthy back-dooring), and obfuscated command and control (C2) channels.
Whilst an HSM can help to protect secret cryptographic keys through isolated execution, a complete TEE would be required for comprehensive protection at all threat levels. For example, in case encryption/decryption is delegated to an HSM, any injected malware could still gain access to the plaintext (personal data, credit card data etc.). Similarly, any injected code could invoke a private key-based operation, e.g. to complete certificate-based authentication, without ever having to actually disclose the HSM-protected key. Figure 2 shows the RV-TEE's architecture superimposed on the generic TEE blueprint, as illustrated earlier in Fig. 1. This setup is not tied to specific security hardware nor requires any OS modifications. It also mitigates threats related to hard-ware level issues, including side channel attacks on ciphers, while keeping runtime overheads to a minimum.

RV-Tee: an RV-centric TEE
The primary components of this design are two RV monitors executing within the untrusted domain and a hardware security module (HSM) providing the trusted domain of the TEE. The chosen example HSM is a USB stick, comprising a micro-controller (MCU), a crypto co-processor providing h/w cipher acceleration and true random number generation (TRNG), as well as flash memory to store long term keys. In this manner, cryptographic primitive and key management code are kept out of reach of malware that can potentially infect the OS and applications inside the untrusted domain. The co-processor in turn can be chosen to be one that has got extensive side-channel security analysis, thus mitigating the remaining hardware-related threats (e.g., [12]). The Crypto OS is executed by the MCU, exposing communication and access control interfaces to be utilized for HSM session negotiation by the protocol executing inside the untrusted domain, after which a cryptographic service interface becomes available (e.g., PKCS#11). In a typical TEE fashion cryptographic keys never leave the HSM. The proposed setup forgoes dealing with the verification of runtime provisioned code since the cryptographic services offered by the HSM are expected to remain fixed for long periods.
The RV monitors complete the TEE: They verify correct implementation of protocol steps and inspect all interactions with the hardware module, both of which happen through the network and external bus OS drivers respectively. Verifying protocol correctness leverages the high-level flavors of RV (in the rest of the paper we refer to this as function call tracing), checking that the network exchanges follow the protocol-defined sequence and that the correct decisions are taken following protocol verification steps (e.g., digital certificate verifications). Inspecting interactions with the HSM, on the other hand, treats hooked functions as sources and sinks for information flow tracking, rather than protocol steps. In both cases the monitors are proposed to operate at the binary (compiled code) level. The binary level provides opportunities to secure third-party protocol implementations, as well as optimized instrumentation applied directly at the machine instructions level. Overall, binary instrumentation is a widely-adopted technique in the domain of software security, including the availability of widely used frameworks (e.g., Frida 3 ) that simplify tool development. The higherlevel RV monitor is tasked with monitoring protocol steps and as such, instrumentation based on library function hooking suffices. This kind of instrumentation is possible to deploy with minimal overheads.
The proposed RV aimed at medium level threats adopts dynamic taint inference approach through a re-purposing of R. Sekar's taint inference algorithm [67], specifically port- ing it from a web application setup to process memory. In the case of data (sources) flowing into Crypto OS call arguments originating from suspicious sites, (e.g., network input, interprocess communication (IPC) or dynamically generated code), the Crypto OS calls represent the sinks. All these scenarios are candidates of malicious interactions with the HSM. In the reverse direction, whenever data flows resulting from Crypto OS call execution and that end up at the same previously suspicious sites, the calls present the tainted sources while the suspicious sites present the sinks. In this case these are scenarios of malicious interactions targeting leaks of cryptographic keys/secrets, timing information or outright plaintext data leaks. Whichever the direction of the tainted flows, the same approximate matching operators can be applied between the arguments/return values of the sources/sinks.
Revisiting the threat levels introduced in the previous section, in the proposed RV-TEE: i) High level threats are covered through RV function call tracing (Sect. 4); ii) Medium level threats are covered through taint-inferring RV (Sect. 5); iii) The low level can be covered through complementary frameworks such as Frama-C in an offline manner; iv) The hardware level is covered by allowing the approach to work with any certified device of choice (Sect. 6). Finally, a nonce-based remote attestation protocol, e.g. [2], can optionally close the loop of trust: Executed by the Crypto OS, its purpose is to ascertain the integrity of the RV monitors in cases where they are targeted by advanced malware infections. Table 1 summarizes how RV-TEE can protect against high/medium/low/hardware level threats targeting cryptographic protocols as compared to the individual security controls it brings together. At this point, it becomes clear that RV-TEE's main proposition is to combine the level of protection provided by the individual state-of-the-art components into a comprehensive solution. Component aggregation is based on the blueprint for TEE design [61]. The level of security brought along by the individual components is specific to chosen tool/configuration/hardware. We will delve deeper into specific choices and evaluate their security in Sects. 4-6. The inclusion of an information flow-based RV component is made for completeness' sake. However this is not intended to form part of the comprehensive runtime solution, rather it is intended to be used in an offline manner, e.g. during testing.

RV function call tracing
To test the feasibility of RV-TEE, both in terms of realworld codebase readiness and practical overheads, we choose a key agreement protocol -ECDHE [59] -and apply our approach to it. Despite having its design proven secure from an analytical point of view, its security in practice can be compromised if not executed with all required precautions.
Three properties for secure ECDHE implementation are: P1 Digital certificate verification in order to authenticate public keys sent by peers: If wrong certificates are sent, or else the correct ones fail verification when using a certificate chain that ends at a root certificate authority, the protocol should be aborted. P2 Both session public keys are regenerated per session in the ephemeral version of the protocol and as such, both

Applying RV to the context
Larva [17] has been available for a decade with numerous applications in various areas [16]. The advantage of Larva is that being automata-based and having Java-like syntax, it offers a gentle learning curve. Furthermore, it has a number of features which come in handy when applying it for protocol verification. Basic sequence of events At its simplest, a protocol involves a number of events which should follow a particular order. Each event corresponds to a hooked library function call. In Listing 1, the first two transitions deal with the start of a new session (sslImport and prConnect).

Conditions and actions
The occurrence of an event is not always enough to decide whether it is a valid step of the 4  protocol or not. Larva supports conditions and actions on transitions to perform checks on parameters, return values, etc. In the example (see lines 5-6 in Listing 1), this was necessary to ensure that the call to destroy the private key is a sub-call of close. Sub-patterns Following software engineering principles of modularity, Larva allows matching to be split into subautomata which can communicate their conclusions to each other and their parent. The second property we are checking needs to ensure that whenever a session fails for some reason, it is properly aborted. Listing 2 shows a property describing a session 'abort' pattern whereupon matching, the success is communicated (using abort.send in line 10) to other automata for which an abort is relevant. Figure 3 shows the second and third properties in their diagrammatic format. For clarity, we have removed some details which are not needed for the reader to understand the general idea.

Hooked functions
The complete list of hooked functions feature in the list of Larva events shown in Listing 3. These events are in turn what trigger the monitoring automata to transition from one state to another. All functions are conveniently exported by NSS3, although freebl3 has to be re-compiled with debug symbols to allow for locating EC_ValidatePublicKey.

Firefox case study
Comprehension of Firefox's usage of NSS yielded an aggressively optimized implementation, with two design strategies being of particular relevance to our experiments. These are: (i) Interleaved TLS sessions executed on the same thread whenever accessing a specific URL over HTTPS; and which in turn are (ii) Executed concurrently to certificate verification on a separate thread. The main implication here is the need to separate individual TLS sessions in order to execute the RV monitors on separate sessions. This task is left to an individual TLS session filtering procedure described by Algorithm 1. Its first step is to identify the beginning and end of each TLS session. This is made possible through NSPR's file descriptors (fd), by pairing calls to SSL_ImportFD and PR_Close for the same fd. This pair and all intervening entries are extracted into their own slice, non-destructively (line 2). Each slice is iterated multiple times (lines [6][7][8][9][10][11][12][13][14][15][16][17][18][19][20]. During the first iteration (lines 8-9) all pending function calls, and all their sub-calls, involving the same fd are pulled into a newly created TLS session trace by Match_ArgsRetVal. Similarly all entries, and sub-calls, with a corresponding NSS context (cx) argument (referred to as cx fd ) are also included, since NSS's cx is pinned to NSPR's fd. Subsequent iterations also pull in calls that are not fdbased, and which do not happen to be sub-calls of the already included functions. In order to do so, a heuristic is employed based on SSL_AuthCertificateComplete and PR_Close and their sub-calls. These sub-calls obviously belong to the same thread of execution of their callers, and comprise various PKCS#11 key derivation/encryption functions. Once these sub-calls are included within the current trace as established by GetKeyAddressesSubCalls (lines 13-14 followed by 18), what remains missing are all other PKCS#11 calls that do not happen to be in these subcalls, along with all other required hooked functions. Multiple iterations have to be executed in order to do so, adding function calls for every matching key-related argument or return value as established by GetKeyAddressesSubCalls This is the heuristic part of the algorithm, with the underlying assumption being that concurrent TLS ses- sions do not make use of the same memory locations to store keys, as otherwise interference between threads ensues. A second underpinning assumption is that each individual session either starts a key derivation sub-call sequence inside SSL_AuthCertificateComplete, or calls PK11_Encrypt on session completion (by PR_Close).
The former occurs whenever the certificate verification thread loses the race with the ECDHE protocol thread, while the latter happens whenever Firefox knows it is sending the final GET/POST HTTP request and closes its end of the TCP connection. This approximate solution trades off precision for efficiency, as compared to tracing all threads at the instruction level, or having to update Firefox's source-code to accommodate individual TLS session tracing accordingly. This heuristic fails whenever Algorithm 1 exits after the second iteration, however it may still be effective in case all required hooked function calls happen to be already sub-calls of the included function calls. Ultimately the non-deterministic behavior resulting from the optimized multi-threaded implementation is a factor.

Experiments setup
Two experiments were set up. The first experiment, Bad_SSL, is intended to demonstrate the first RV property concerning certificate verification errors. It makes use of 11 sites, sub-domains of badssl.com, with known certificate issues. The second experiment, Top_100, based on Alexa top 100 sites (as of 05/06/2019), sets out to demonstrate the practicality of the binary level instrumentation. It also sheds light on Firefox's runtime behavior, verifying its expected correct execution with respect to EC public key validation and private key scrubbing, through the remaining RV properties. Furthermore, sessions that do not match any of these properties can also provide insight into full-session: resumption ratio, as well as Algorithm 1's heuristic accuracy. Each site has its root URL accessed 10 times in a row, with all sessions automated through Selenium v3.141.0/geckodriver v0.24.0 on an Intel i7 3.6GHz x4 CPU/16GB RAM machine. Function hooking implementation uses Frida v12.4.8 and performed solely on Firefox's parent process, which is the process that takes care of all networking functionality over TLS. Table 2 shows that in Bad_SSL all sessions are eventually aborted on certificate verification failure, as evidenced by property 1a matches 7 and no matches for 2a&2b. Property 3a matches are a consequence of ECDHE steps being executed concurrently for certificate verification inside a separate thread. As for Top_100 the 10 access requests per URL generate a total of 3,366 sessions. This is due to the fact that each page may in turn initiate further TLS sessions due to ancillary HTTP requests being generated by the initial HTML. None of these sites generated a certificate error, resulting in not a single session matching 1a&1b (which is expected by frequently accessed sites). The non-matching of property 2b and a very low number of property 3b matches, indicate the expected correct behaviour with respect to EC public key validation and private key scrubbing respectively. The six matches for the latter were traced to odd instances of non-returning SECKEY_DestroyPrivateKey calls, indicating some implementation quirk occurring during automated browser sessions. In fact this scenario could not be reproduced with manual browser sessions. The numbers of combined matches for properties 2a&2b and 3a&3b matches, each being less than 3,366, requires some context. Firstly, remember that TLS sessions may make use of session resumption rather than go through the full handshake. From the acquired traces we found 1,951 such sessions, lowering down the expected combined total for each property to 1,415. The pending discrepancy for 3a&3b (totaling 1,411) is accounted for by 4 sessions that get aborted for some reason even before ECDHE and certification verification threads execute. The gap for 2a&2b is accounted for additionally by 69 sessions that generated no alerts on exiting after iteration 2 of Algorithm 1, and without managing to include the required calls into the trace by that time. This accounts for an effective accuracy rate of 0.9795 for the underpinning heuristic. This is quite high, especially when considering the attained instrumentation efficiency. As shown by the RV function tracing row in Table 3, when comparing the Top_100 sessions executed with/out RV, the mean overhead is just 5.26%, with the pair-wise differences not even surpassing the threshold of statistical significance. A Wilcoxon signed-rank test returns a p-value of 0.281, indicating that external factors, e.g., network latency, server load and browser CPU contention, may be having a larger impact than instrumentation. In fact it was observed that varia-tions between pages (e.g., Youtube takes longer to load than Google) and the effect of browser caching even caused some instrumented runs to run faster than non-instrumented ones.

RV for taint inference
The experiment presented in the previous section represents an application of RV at a high level since the properties checked are related to the protocol specification. In this section, we present an application of RV which addresses medium level threats (see Sect. 2): We monitor information taint. Inspired by the work of Sekar [67], we make a number of modifications and apply the algorithm in our context.

Algorithm overview
The algorithm presented in [67] is based on the intuition that taint flows between sources and sinks can be inferred through sub-string edit distance: A sub-string s might match a substring of t even though they might not be any exact s in t. This is useful given that data might be modified during flow in which case taint inference using exact sub-string matching is likely to produce false negatives. With the aim of optimization, the approach initially adopts a coarse-grained sub-string matching algorithm based on multisets, i.e., it just compares the number of occurrences of each alphabet character under consideration. This has the advantage that, for each character, the shorter input string is slided over the longer input string for comparison, the additional computation is a constant (reducing the count of the character which falls outside the sliding window and conversely for the character which becomes visible in the window).
The coarse-grained matching result (a conservative version of it) is normalized and compared to a threshold d. The bigger d is, more likely it is that data being exfiltrated is detect even if modified to some extent. On the other hand, a bigger d also gives rise to a higher probability of false positives, i.e., two sub-strings being coincidentally similar. Note that this probability depends among other things on the length of the input string and size of the alphabet. Once the coarse-grained comparison falls below the threshold, the relevant sub-strings are compared using the fine-grained algorithm to reduce the probability of false positives. Fig. 4 The taint inference process using RV

Adaptations
The context of Sekar's work is to protect against injection vulnerabilities such as SQL injection and cross-site scripting. This is rather different from our application for detecting data exfiltration in the context of a web browser. Rather than dealing with website inputs, in our case we deal with decrypted HTTP headers and content; rather than checking against the web application's outgoing requests, we check against the buffer dump from the browser's memory heap (Fig. 4 depicts the idea). Of course these differences have significant implications on the specifics of the algorithm. Alphabet The alphabet considered in [67] varies between 40 and 70 (depending on whether the application being considered is case sensitive or not). In our scenario, we are dealing with a more generic byte stream (decoded from base 64) which might represent text as much as an image. Therefore it makes sense to have an alphabet covering the whole range of a byte, i.e., 256 per byte. Input string length The length of the input strings being considered in our case is significant: in the top five sites decrypted data per page load averaged to 1352 bytes. On the other hand, the heap size is of 1Mb per page. This contrasts sharply with length of typical web application input and request strings. Time window Given the sequence of HTTP responses received by a browser, the question arises: For how long do we keep checking the heap for particular HTTP response content? Of course, the attacker might delay exfiltration to avoid detection so a longer time window makes the approach more robust. On the other hand, the longer the time window, the more expensive the approximate sub-string matching. Matching threshold The matching threshold strikes a balance of false positives and false negatives, i.e., a low threshold reduces the possibility of reporting a match when there actually isn't, but might easily miss matches where the attacker made slight modifications to the information. Since the probability of false positives is a function of several elements, including the alphabet size and string length, we repeated the experiment reported in Figure 8 of [67] to include a bigger alphabet. We repeated the experiments 40/0.33 and 70/0.7 (i.e., alphabets 40 and 70 with thresholds of 0.33 and 0.7 respectively), and added experimented with three thresholds for alphabet size 256. We note that as expected, the probability of two random strings matching coincidentally becomes smaller as the alphabet size and string length increases (in the case of exact string matching, given an alphabet of size a and string length n, the probability of a match is a n .)

Algorithm and complexity analysis
Building on the overview and adaptations presented in the previous subsections, we now provide a more detailed explanation of the algorithm as well as analyze its complexity.
The algorithm starts with the initialization of a number of parameters: We set the window size to the length of the (shorter) sink string. If the sink string is long, this might require a bigger threshold and consequently, more calls to the fine-grained string matching function. Next, we set the threshold parameter; this affects which coarse-grained matches will be further considered for fine-grained matching. We multiply the threshold with the window size since the bigger this is, the more tolerance we need to give.
Following Sekar's approach, coarse-grained matching simply compares the histogram of bytes of both strings, computable in O(n + m) where n and m are the lengths of the compared strings. Given that our alphabet size is 256 because we are not restricting our byte value range, our sensitivity to mis/match at this stage is substantially higher than that of the previous work.
The fine-grained algorithm is a dynamic programming implementation of edit distance calculation. Since in our case we have substantially larger strings, this becomes more expensive both in terms of time and space complexity. The algorithmic complexity of calculating the edit distance using dynamic programming is known to be O(nm).

Implementation
From an RV perspective, the architecture we adopt is interesting due to its combination of online and offline monitoring: The online monitoring component of our setup involves dynamic binary instrumentation to obtain the sources and sinks for analysis. While this adds a certain level of overhead, we perform the more expensive sub-string matching operations on separate resources, i.e., offline. This can be run in parallel to the system (the browser), but it will not hinder it from operating smoothly. The downside being that the monitor might fail to keep up with the system and therefore detect issues late -well after the data has been exfiltrated but hopefully in time to warn the user and stop further breaches.
The offline part, mirrors the algorithm described earlier in Sect. 5.1. More details regarding the online part are given below: Hooked functions Taint source monitoring requires hooking ssl3_UnprotectRecord and tls13_Unprotect Record. Both are internal functions to NSS3, and which therefore necessitates re-compilation with debug symbols. Together, these two hooks cover all HTTP payloads decrypted within TLS <= 1.2 and 1.3 sessions respectively. In both cases the sslBuffer output parameter is used to dump the corresponding decrypted buffers. The PR_Write, PR_Writev, and NSS3 exported functions provide taint sink monitoring. The corresponding buffers are dumped using the buf output parameter. Experiment setup The setup for this experiment is the same as per Sect. 4.2. Taint sources are hooked inside the Firefox parent process using Frida, while taint sinks are hooked inside child processes executing web browser tabs. Due to Firefox's sandbox [50] that considers all child processes as low-integrity, setting them up in a highly restricted execution context, the use of static instrumentation e.g., using LIEF [38], is necessary in order to inject instrumentation code. Frida-gadget conveniently packages the entire Frida DBI in a stand-alone shared library for use within such a setup.
While the parent process is responsible for all networking and TLS activity, these child processes are tasked with parsing web content and rendering it to screen. Memory scraping malware, therefore, is most likely to get injected inside the tab processes to increase its chances of stealing information. On the other hand, we notice that plaintext in the parent process gets over-written as soon as subsequent ciphertext buffers get decrypted.

Results and Limitations
Starting with the sub-string matching aspect of the experiment, we note that our aim was to show that the approach is plausible. However, further experimentation is required to answer several questions including the right threshold and time window to use for the algorithm.
Finally, a limitation of our current prototype concerns obfuscation of leaked information (e.g., compression, encryption, stenography) by malware. This issue can be addressed through RV rules that define what constitutes process tracing or injected code, e.g., any code that is dynamically loaded, and by identifying all its heap accessing instructions as taint sinks. Further still, targeted attacks could employ malware that is knowledgeable of the taint inference RV monitor and which may attempt to tamper with it. Unfortunately this is part of the arms-race between attackers and information security, which is always bound to happen. Remote code attestation could be considered in cases where this is deemed cost effective.
The same experiment as per Sect. 4 was conducted, only this time it was the taint inference-based low-level RV that was activated. As shown by the RV for taint inference row in Table 3, in this case the 0.7% overheads are even lower than the higher-level RV, and also not statistically significant. While we do not expect this to be the case in general, in this particular setting the number of hooked functions for taint inference is much smaller than that used in the previous experiment. In any case, results returned for this second monitor continue to affirm the practicality of the function hooking approach adopted by RV-TEE.

Real-world environment considerations for RV-TEE deployment
The results presented so far validate the approach for the individual RV components but does not really tackle the question of whether RV-TEE as a whole is practical in a realworld environment. The statistics we presented in Table 3 are promising with respect to the overheads introduced by the RV layer. Apart from these components, the remaining points of concern are: the introduction of the HSM (Sect. 6.1), which can potentially pose a bottleneck to the protocol execution; along with evaluating RV-TEE's effectiveness in the presence of information-stealing malware (Sect. 6.2), that leaks information subsequent to HSM decryption. In this section, we focus on these two aspect by carrying out experimentation on the performance of the HSM module.

The SECube HSM
As a first experiment we chose Blu5 Lab's SECube [10].  [59]. The encryption for each website was performed 10 times in a row on a Dual Core Intel Core i5-3317U CPU/6GB RAM machine. Next we did the same experiment, but instead of using NSS, we called an AES-GCM implementation on SECube.
Results Table 4 shows the overheads recorded for the encryption operations registered by SECube when compared to Firefox's NSS library executing fully on the end-user's machine. Results are shown both separately for the top five websites, as well as the combined measurements for all hundred websites. In each case, the total page load time is shown along with the portion taken up by NSS encryption. These values provide the context within which to analyze the increase in processing times once encryption is offloaded to SECube. In all cases, the increase in processing times is confirmed to be statistically significant by a Wilcoxon signed-rank test. While inevitably posing a bottleneck, due to the USB I/O involved, SECube's hardware specifications manage to keep overheads within a practically acceptable range. An average of 1723 ms may disturb the overall web browsing experience, but only a little. To keep the overhead as small as possible, next step would be to use hardware acceleration through an FPGA hardware implementation of AES. Overall, this experiment setup shows that RV-TEE can be deployed at acceptable costs both in terms of processing overheads and HSM costs.

Plaintext exfiltration case study
We simulated a banking trojan infection of the Firefox browser using the Metasploit exploitation framework (MSF) [49]. The simulation was designed to mimic all stealth techniques typically employed by such malware, as discussed in Sect. 2.4. Specifically, this setup employs a multi-staged loading of the malware, with the initial malware payload being heavily obfuscated, while the subsequent code loading employs the Reflective DLL injection technique [23] and which maintains stealth by never touching the disk or operating system data structures. The C2 channel over which the additional code is loaded, as well as over which the subsequent exfiltration happens, is itself encrypted. Furthermore, the actual information stealing is pulled off through a malware dump using a perfectly legitimate command, procdump [26], and without the need to break the Firefox sandbox [50]. Overall, this setup mirrors a malware infection that is very difficult to detect both at the host and the network levels, and is, therefore, representative of those scenarios where protection responsibility would fall on RV-TEE. Experiment setup RV-TEE's taint inference is the component responsible for detecting any plaintext leaks as explained in Sect. 5. As per earlier experimentation, we hooked ssl3_UnprotectRecord and tls13_ UnprotectRecord in NSS3.dll for the taint sources. Moreover, the RV-TEE's implementation was extended to perform function call tracing over all external processes that obtain a handle to any of the Firefox processes. Through this extension it becomes possible to hook potential taint sinks for the stolen plaintext, irrespective of whether malware gets injected into Firefox or else plaintext is stolen by abusing process tracing. The full set of traced taint sinks comprises: Toolhelp32ReadProcessMemory and Read ProcessMemory in Kernel32.dll, and Read ProcessMemory in Kernelbase.dll. In this manner we aim for early taint sink hooks, thereby avoiding the limitation of taint inference whenever sink strings are obfuscated or encrypted.
The sub-string matching threshold d is set to 0.1. While the analysis presented in Fig. 4 indicates that even d = 0.85 could result in an acceptable FP rate stemming from coincidental matching, we lowered this threshold to compensate for the occurrence of consecutive 0 values occurring more frequently than random. These occurrences are a result of i) memory pages being zeroed out before page re-allocation by operating systems, ii) wide-character encoding used by web-browsers to support Unicode character sets, and iii) data structure padding employed by compilers. Beyond lowering the sub-string matching threshold we therefore also extended RV-TEE's implementation to convert all widecharacter strings to single-byte ones whenever all individual characters in the string have a leading 0 byte. Furthermore, all-0 string matches are discarded, and which in any case carry no information. Finally, all excessively small source strings, specifically those less than 20 bytes, were not considered.

Results
The exfiltration functionality of the simulated banking trojan was activated whenever the browser was directed to the live.com webmail site and initiated an authenticated session. While the connection was protected through authenticated encryption over a TLS1.3 session, the simulated malware nonetheless exfiltrated the decrypted email content directly from the browser's memory to a 1.07 GB dump file, and which was then subsequently transferred to the C2 server.
The sheer, but realistic, size of this exfiltration operation places significant strain on the taint inference RV, requiring an operation that took 110 minutes to detect the first exfiltrated string. Segments of the source and sink strings concerned are shown in Listings 4 and 5. From these segments, it can be observed how the source string occurs as a slightly modified sub-string within the sink string. Successful detection is attributed to the approximated sub-string matching involved, as well as to the choice of function hooking that provides sufficiently early access to the exfiltration process. Eventually, the dumped memory content is fully encrypted

Conclusions and future work
An RV-centric TEE, RV-TEE, targeting various levels of security threats ranging from high-level to hardware-level, has been proposed to a protocol implementation; promising to improve the robustness of the implementation with minimal additional hardware and/or runtime overheads. A feasibility study of the approach has been carried out on a real-world third party code-base, which implements a stateof-the-practice key establishment protocol.
To complement protocol-level RV, a second layer of RV was proposed for taint inference; monitoring the trust boundary against data exfiltration. Given the smaller number of functions hooked, the overheads are lower than the first experiment. On the other hand, the analysis is significantly more cumbersome but this can be done offline, even though it can benefit from a more optimized implementation. Additionally a realistic attack mimicking a banking trojan controlled by an encrypted command and control (C2) channel demonstrated its practicality. An HSM component completes RV-TEE. In this regard the SeCube HSM was experimented with. While runtime overhead results show that the web browsing experience is somewhat affected, the on-chip FPGA which harbors the potential for implementing encryption in hardware, is yet to be leveraged.
While overall study of employing RV in the context TEEs shows promise, we note that: 1 ... -Program comprehension is required, both for setting up function hooks as well as to enable individual TLS session monitoring. Moreover, real-world code tends to be written in a manner to favor efficient execution rather than monitorability, hence the need for an algorithm to filter individual sessions in our case study. However, in case RV is used on one's own code-base, support for RV could be thought out from inception, with these issues being somewhat alleviated. -Adding RV to a system naturally requires trust of the introduced code. There are however several ways in which concerns in this regard can be addressed: (i) the RV code is generated automatically from a finite state automaton, thus reducing the possibilities of bugs; (ii) more importantly, only the hooking code interacts directly with the monitored code. This separation ensures that RV interferes as little as possible with the monitored system.
In terms of future work, firstly, further HSM options can be considered. Following up initial experimentation with AES-GCM, we also plan to implement ChaCha20-Poly1305 to complete the authenticated encryption options for TLS1.3. Next up is to consider full secure key exchange implementation inside SECube, thereby pushing ECDHE's implementation to the HSM. In fact, its implementation could be pushed even further away from malware's reach onto SECube's on-chip security controller. Featuring an ISO7816 interface and Global Platform 2.2 compatibility, this deployment approach would trade speed for further security. This security controller could also be leveraged for authenticated code provisioning, even though this may somewhat weaken overall security guarantees. Despite being resourcerestricted, the security controller offers hardware-accelerated ECDHE and RSA to make up for it. The on-chip Lattice MachXO2-7000 could provide further practicality and security still. The possibility of a hardware implementation of the symmetric cipher would provide increased d/encryption throughput as well as protection from side-channels related to non-constant time employment by key-related operations in the software implementation, all at one go. At first glance, this HSM setup is deemed promising to take RV-TEE even closer to practical deployment. Additionally, we intended to futureproof the proposed HSM by implementing one or more of NIST's PQC round 3 key establishment algorithms. Finally, the RV taint inference component of RV-TEE also deserves further attention in terms of an optimized implementation. In this case runtime overheads are not the issue, but rather we seek the additional benefit for timely detection of data exfiltration.