A survey on optimizations towards best-effort hardware transactional memory

Abstract

Transactional memory has been attracting increasing attention in recent years, and it provides optimistic concurrency control schemes for shared-memory parallel programs. The rapid development and wide adoption of transactional memory make this programming paradigm promising for achieving breakthroughs in massively parallel computing. There has been a large number of discussions towards transactional memory systems, which aimed at providing relatively simple and intuitive synchronization construction for shared-memory parallel programs without sacrificing performance. Hardware transactional memory (HTM) has become commercially available in mainstream processors, however, due to several inherent architectural limitations that will abort hardware transactions, such as cache overflows, context switches, hardware as well as software exceptions, etc., nowadays HTM systems come in a best-effort way, which necessitates the adoption of a software fallback path to ensure forward progress. In this paper, we survey state-of-the-art software-side optimizations for best-effort hardware transaction system, as well as several novel performance tuning techniques. Research efforts about joint usage of HTM and non-volatile memory (NVM) are also discussed.

Introduction

Multi-core architectures gain dramatically increasing popularity in modern processors. Meanwhile, massively parallel processing has become the major trend for constructing computational systems with higher performance. As a consequence, constructing efficient synchronization mechanism to better unveil the potential performance of the multi-core processors, has become a major challenge for building parallel applications. The advancements in transactional memory provide new opportunities for parallel computing. Multiple academic and industrial fields, such as in-memory database, cloud computing, high performance computing (HPC), etc., may potentially benefit from the rapid development and wide adoption of transactional memory.

Conventionally, parallel programmers employ lock-based schemes to cope with the synchronization issues, where the programmer explicitly specify lock-protected critical sections to synchronize concurrent accesses to shared data. From the perspective of the sizes of the critical sections, lock-based schemes could be categorized into coarse-grained locking and fine-grained ones. Though coarse-grain locking is relatively straightforward, it could hardly achieves higher scalability. Conversely, find-grain locking provides improved performance, however, it is more error-prone and hard to reason about.

Motivated by the desire to integrate the accessible programmability of coarse-grained locking and the performance benefits acquired from fine-grained locking into one parallel programming paradigm, Transactional Memory (TM) was proposed by Herlihy and Moss two decades ago (Herlihy and Moss 1993) to facilitate parallel programming. The TM-based synchronization employs transaction as the essential concurrency control unit, which is defined as a code block that executes following a all-or-nothing style: either the entire code block atomically takes effect or it discards all its changes to the shared data (Abadi et al. 2011). A provisional version of transactional constructs has been supported by C++ as a Technical Specification since 2015 (Zardoshti et al. 2019).

In recent years, mainstream commercial processors started to provide hardware support for transactional execution, makes hardware transactional memory (HTM) widely available to the mass market. In particular, Intel extends its x86 instruction set architecture with the transactional synchronization extensions (TSX) in June 2013 on selected multi-core processors based on the Haswell microarchitecture (Hammarlund et al. 2014), which is a primary representative of the first generation commercially-ready HTM.

It is noteworthy that, HTM typically has a best-effort nature: due to some intrinsic architectural constrains, HTM provides no progress progress guarantees to the transactions (Yoo et al. 2013; Diegues et al. 2014). As a consequence, the programmers are left with the responsibility to construct a software fallback path for the unreliable hardware transactions. One intuitive and commonly-used solution is to let the aborted transactions insist retrying on the hardware path for certain predefined threshold, and to acquire a coarse-grained lock (usually a global lock) to serialize the concurrent transactions when the threshold is reached. Though its simplicity, under the worst circumstance, the overhead of this global lock-based rollback scheme will approach the cost of the coarse-grained lock with the additional transactions aborts expense, which exerts significant impacts on the overall performance and efficiency of HTM.

Addressing the inherent limitations of hardware transactional memory, tremendous researches have focused on designing efficient abort handling schemes. In this paper, we review the most recent cutting edge researches towards constructing effective fallback mechanisms for best-effort HTM, as well as some innovative works aiming at reducing the transaction abort rate.

In summary, this survey makes the following contributions:

  1. 1.

    This paper comprehensively presents general software-based approaches for complementing the weakness of hardware transactional memory.

  2. 2.

    Several novel performance tuning tools and technical dynamic tuning proposals for the HTM are surveyed.

  3. 3.

    Practical joint usage of hardware transactional memory and non-volatile memory is discussed in this paper.

The rest of this paper is organized as follows. Section 2 presents a brief overview of transactional memory and the architectural characteristics of commodity HTM. Section 3 discusses state-of-the-art software fallback approaches for commodity best-effort hardware transaction memory. Section 4 presents research efforts towards performance tuning for hardware transactional memory. Section 5 surveys several HTM-based durable memory transactions that requires no hardware modifications to existing commodity HTM. Finally, Sect. 6 concludes this survey.

Background

This section presents the essential background regarding the architectural characteristics of HTM to illustrate the issues of software-assisted part of work for building practical HTM runtime.

Hardware transactional memory

To support synchronization management, the processor hardware typically provides machine-level primitives, such as test-and-set, compare-and-swap, etc. for programmers, these instructions are guaranteed to take effect atomically in the hardware.

Higher level synchronization abstractions, such as lock, unlock, barrier, semaphore, etc., are constructed upon these machine-level primitives. With lock and unlock semantics, one could specify critical sections to control accesses to shared resource. With the lock-based coordination, only one thread can be in the critical section at a time. As a consequence, the lock-protected operations issued by concurrent threads are guaranteed to be mutually exclusive.

However, it is typically challenging to build highly efficient applications with locks. Regarding the programming efforts, it is easy and straightforward to implement correct mutual exclusion with a single lock. However, with such coarse-grained locking, concurrent operations are fully serialized regardless of they have true conflicts or not. In the contrast, fine-grained locking tries to different pieces of data with different locks to avoid contention on single lock and enable more parallelism. Of course, it is tricky to ensure correctness for parallel applications built with fine-grained locks. It requires careful thoughts to design the algorithm and improper use of locking can result in critical problems such as starvation, deadlock, priority inversion, etc., which are especially severe in highly concurrent situations.

To avoid the common problems faced by lock-based solutions, there have been a number of research investigating lock-free programs in which concurrent threads never prevents each other from making progress. However, constructing a lock-free algorithm also requires deliberate design and careful use of the machine-level synchronization primitives, which is error-prone and even harder than programming with fine-grained locks.

To make lock-free synchronization as efficient and easy to use as lock-based solutions, transactional memory was proposed. Transactional memory is a kind of declarative synchronization that only requires the programmers to declare what should be synchronized but do not need to define how to implement it. Note that with imperative synchronization solutions, such as locking and lock-free techniques, programmers have to explicitly synchronize concurrent operations with lock or machine-level synchronization primitives.

A transaction is a programmer-specified sequence of read-modify-write operations applied to multiple words of memory. Furthermore, the transactional memory system maintains the serializability and atomicity properties of the transactions. The serializability property means that operations inside one transaction never appear to be interleaved with operations belongs to any other transaction, and the atomicity property ensures that all the memory updates of a transaction are either instantaneously globally visible or discarded together. More specifically, the transactional memory system maintains a read-set and a write-set for each transaction, with the read-set tracking the memory locations read by the transaction and the write-set keeping the tentative memory updates of the transaction. A transaction performs the commit operation provided by the transactional memory system at the end of execution to try to publish its memory updates to the global world. A committing trial of a transaction may succeed if and only if no data conflict is detected, which means that no memory location inside its read-set are kept in the write-set of any other transaction and it has no tentative update to memory locations resides in the write-set or read-set of other transactions. Upon a failed committing trial, a transaction will discard all its tentative updates. Additionally, a transaction may also explicitly give up execution via an abort operation.

Data versioning and conflict resolution are two essential aspects in implementing a transactional memory system. Data versioning manages the committed and uncommitted versions of data for concurrent memory transactions. Conflict resolution checks and handles data conflicts between transactions. Regarding architectural support for transactional memory, the processor cache is typically extended to enable data version management, and cache-coherence protocols can be extended for detecting data conflicts between concurrent transactions. As a consequence, the transactional processing capability of such kind of hardware transactional memory is mainly constrained by the volume of the CPU cache. Transactions with memory footprint exceeding the cache capacity will be aborted. Besides, as most of the operating system will clean the CPU cache during thread scheduling, the time duration of a hardware transaction will be limited to the scheduler quantum of the operating system. Moreover, evicting a transactional entry from the cache will also abort a hardware transaction, because the cache coherence protocol cannot detect any potential conflict with the absence of the transactional entry in the cache. In summary, hardware transactional memory is inherently a best-effort solution due to the limitation of the hardware resource, especially the CPU cache.

Compared with conventional lock-based techniques or lock-free solutions, hardware transactional memory can be regarded as an optimistic concurrency control mechanism. When there is no data conflict, hardware transactional memory has nearly zero overhead. Whereas locks and machine-level synchronization primitives have constant operating overhead even for the conflict-free situations, because we will have to pessimistically acquire the locks or perform expensive atomic operations to avoid data races regardless of either there is true synchronization conflict.

Intel\(^{\textregistered }\) transactional synchronization extensions

In June 2013, Intel released its transactional synchronization extensions (TSX). TSX introduces a set of new instructions(TSX-NI) to the x86 instruction set architecture. Nowadays commodity Intel processors have been widely shipped with TSX.

TSX enables optimistic concurrency control in hardware, it allows the processor to speculatively (or transactionally) execute conventional lock-protected critical sections or user-specified memory transactions. Unlike the traditional mutual-exclusion programming model, TSX can avoid unnecessary synchronization overheads, for example, when concurrent threads accessing disjoint memory locations. TSX provides two sets of software interfaces for programmers to specify transactional code regions—Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM) (Intel Corporation 2016).

HLE is capable of converting legacy lock-protected critical sections into transactional code regions. HLE extension involves the XACQUIRE and XRELEASE prefixes. To elide a lock via HLE, a programmer need to respectively deposit the XACQUIRE prefix and the XRELEASE prefix provided by TSX in front of the conventional lock and unlock instructions as hints to invoke HLE. In order processors without TSX, these hints are simply ignored and the lock-protected critical sections are serialized through normally acquiring and releasing the lock. However, in a TSX-capable processor, the XACQUIRE and XRELEASE mark the beginning and ending of a hardware transaction, in which the address of the elided lock is added to its read-set regardless of the write operation associated with the acquisition of the lock. In other words, the lock is acquired speculatively, which means that the logical processor does not issue any write request to the global state of the lock. Furthermore, with the lock address kept in the read-set, other logical processor can also speculatively acquire the lock and enter transactional execution mode simultaneously. Thus, critical sections that do not conflict could run in parallel with HLE. When data conflict is detected, the transactional critical section will fall back to acquire the lock as the legacy code does. In summary, HLE provides a fast-path for contention-free code regions, but it cannot avoid the drawbacks of locks, such as deadlock, etc., as it would fall back to acquire the lock non-transactionally when data conflict occurs.

RTM is transaction-centric, it provides xbegin and xend to explicitly delineate a transactional code region. Meanwhile, one could use xabort to explicitly instruct the processor to abort a RTM-based memory transaction. And there is a xtest instruction for querying whether the processor is executing a transactional code region. Compared with HLE, RTM is a relatively more flexible interface to transactional execution, programmers can define fully-customized fallback path for transactions failed to commit. Note that the programmer-specified software fallback path is necessary for RTM as it doest not ensure successful commit all the time.

While executing a transactional code region, the processor leverages the private CPU cache to conduct data versioning. More specifically, the CPU core executing a transaction will buffer the uncommitted transactional updates in its private cache and take exclusive ownership of the corresponding cache lines that holding these transactional writes (transactional lines). Thus, if there were concurrent read or write requests issued from other CPU cores to the transactional lines, the conflicted accesses can be captured through intercepting the cache-coherence protocol messages. TSX ensures atomic commit for successful transaction executions, otherwise, all of the buffered transactional lines will be invalidated.

The hardware features of TSX makes it inherently a best-effort solution. Except for true conflicts that prevent optimistic concurrency control from obtaining performance gains, there are several architectural restrictions that will abort the execution of a hardware transaction. Firstly, the cache-based data versioning constrains the footprint of a transaction due to limited cache volume, which means that any transaction with its transactional footprint larger than the cache size will be permanently aborted by the hardware, even if it presents no conflict with any other transaction. Additionally, there are some transaction-unfriendly instructions and events, such as I/O instructions, explicit cache line flushing operations, context switches, page faults, etc., may also abort a transaction. Lastly, as data conflict detection is performed at the granularity of a cache line in TSX, false-sharing of cache lines may also lead to furious aborts. As RTM provides no guarantees as to whether a transactional code region will successfully commit, the programmers have to provide an alternative non-transactional fallback path.

Research efforts taxonomy

Addressing the previously mentioned architectural restrictions, a large number of researches have been conducted to refine the best-effort HTM. In our survey, we categorize the selected works based upon aspects of hardware limitations they try to overcome and their main contribution to transactional programming with hardware-side memory transaction execution support. More specifically, we adopt the following taxonomy:

  • Research efforts towards the fallback path

  • Performance tuning techniques

  • Incorporation with non-volatile memory

Software fallback approaches

Addressing the inherent best-effort property of HTM, a software fallback is necessary to provide progress guarantees for the transactions failed to commit successfully on the hardware path. One of the most commonly-used approaches is to serialize the unsuccessful transactional code regions via a global lock. In which circumstance, a hardware transaction will roll back to a non-transactional lock-protected critical section.

Fig. 1
figure1

\(T_0\) commits inconsistently with {X=0, Y=1} observed

However, additional considerations need to be taken to maintain the consistency between the transactional executions and the non-transactional ones. Consider the example depicted in Figure 1, which gives one possible interleaving of transactions \(T_0\) and \(T_1\), with \(T_0\) residing on the transactional hardware path and \(T_1\) falling back to the non-transactional lock-protected fashion. Without losing the generality, we assume that X=0 and Y=0 at the beginning. Firstly, \(T_1\) acquires the lock and stores 1 to Y. Then, \(T_0\) starts its transactional execution and observes an inconsistent view, X=0 and Y=1. Had both \(T_0\) and \(T_1\) run transactionally, Y will be respectively added to \(T_1\)’s write-set and \(T_0\)’s read-set, thus a data conflict on Y will be detected by the HTM. Unfortunately, the non-transactional update to Y will bypass the conflict check, which eventually results in erroneous commit in this example.

This section presents a review of recent research efforts towards designing efficient software fallback for RTM.

Eager-subscription

Fig. 2
figure2

Eager subscription to the state of the global fallback lock (Calciu et al. 2014)

One simple way to preserve consistency and correctness for current transactional and non-transactional executions is to prevent the transactional ones from running, which is the way employed in Intel’s HLE as we mentioned previously. As shown in Fig. 2, a hardware transaction checks the state of the global lock immediately after it starts, an action known as eager check. If the lock is locked, the transaction is simply forced to abort, as shown in case (2) and case (3) in Fig. 2. If the lock is checked as unlocked, the transaction starts to execute on the hardware path. It is noteworthy that while checking the state of the lock, the lock is also added to the read-set of the transaction. Thus, the transaction makes a subscription to state updates of the lock, any following acquisition to the lock before its commit will abort it, as shown in case (1) and case (4) in Fig. 2.

Lazy-subscription

Fig. 3
figure3

Lazy subscription to the state of the global fallback lock (assuming no data conflict) Calciu et al. (2014)

The main drawback of the eager-check approach is that it completely disallows concurrent execution of the speculative transactions and the ones on the lock-based fallback path, even if they are conflict-free. Motivated by which, the lazy-check approach is proposed (Calciu et al. 2014). Instead of reading the lock state immediately after entering the transactional region, a hardware transaction with lazy-check employed postpones the subscription to the lock state until the end of its execution. More specifically, a hardware transaction reads the lock state right before committing, it commits if the lock is free, otherwise it explicitly aborts. As shown in Fig. 3, with lazy-check, case (4) could successfully commit.

As lazy-check introduces overlap of transactional and non-transactional executions, a hardware transaction may observe inconsistent states. In terms of correctness, the lazy-check mechanism relies on two guarantees. First, TSX provides hardware sand-boxing which will abort the transaction when it raises an exception. In other words, exceptions caused by inconsistent observation in one transaction will not affect the other transactions. Second, TSX adopts strong atomicity, which means that a non-transactional execution will also abort a hardware transaction if their memory accesses contains conflicting locations. Thus, overlap between transactional and non-transactional executions become possible only if all the non-transactional accesses to the conflicting locations are ordered before the corresponding transactional accesses. If the lock is free at the end of a transactional execution, the hardware transaction could make sure that no non-transactional execution is in process at that time, which is semantically equivalent to postponing transactional execution until the non-transactional one completes.

Though lazy-check could increase the amount of concurrency between transactional and non-transactional executions, it may present security hazard in some rare situations. As we mentioned before, overlap between transactional and non-transactional executions may produce inconsistent observations to hardware transactions. If the inconsistent states raise exceptions, the corresponding transaction will be aborted by the hardware. However, there are indeed situations, where inconsistent states can lead to a jump to the commit instruction with the lazy-check action entirely bypassed, which eventually results in incorrect commit of the hardware transaction.

Fine-grained

Both eager-check and lazy-check could be categorized into coarse-grained conflict detection approaches, which present high false positive rate. An exciting idea is to build fine-grained conflict detection mechanism to achieve more performance benefits (Dalessandro et al. 2011). However, typical solutions to accurate conflict detection, such as memory instrumentation, will incur significant extra overheads. On the one hand, instrumentation to memory accesses lengthens the time a transaction would need to complete execution. On the other hand, cache lines that tracking the locations accessed by a transaction will enlarge the memory footprint of it, which will result in capacity abort in the worst case. Therefore, it is necessary to balance the trade-off between improving the accuracy of conflict detection and reducing the additional overhead imposed by annotations to memory accesses.

Given that it is unnecessary to instrument the speculative threads when the lock is free, Dice et al. (2016) proposed a on-demand instrumentation scheme to reduce the time overhead of memory access annotation. At the compiling phase, a slow path with instrumented code and a fast path with unmodified code are generated together for each transactional code region, which is supported by modern compilers like gcc, etc. Furthermore, a final decision on path selection will be made dynamically at the running time according to the lock state.

Fig. 4
figure4

Global sequence number and local sequence number

With respect to improving the memory-efficiency of recording the addresses accessed by a transaction, constructing per-transaction signatures with Bloom filters, which can approximately encapsulate the read-set and the write-set of a transaction in single cache line, seems an ideal choice (Sanchez et al. 2007). However, the work by Calciu et al. (2014) observed that the per-transaction signatures may produce not inconsiderable number of spurious aborts on Haswell’s HTM. When running concurrently with non-transactional execution, a hardware transaction will compare its signature with that of the software transaction. As a consequence, the signature of the software transaction will be added to the read-set of the hardware transaction. Given the strong atomicity property of Haswell’s HTM, any update to the signature of the software transaction will raise data conflict with the hardware transaction and eventually abort it. Unfortunately, a transaction will modify its signature every time it accesses the memory. Thus, even if the hardware transaction and the software one are entirely conflict-free, the hardware transaction may also be aborted due to touching the signature of the software transaction while performing conflict detection before committing, which makes it prohibitively expensive to achieve fine-grained conflict detection via the signature-based approach on best-effort HTM. To avoid the spurious conflicting accesses during validation phase, the work by Dice et al. (2016) adopts ownership records (orecs) instead of per-transaction signatures. Each memory address is associated with one of the orecs through hash mapping. Prior to every load or store instruction, the thread that holding the lock acquires ownership of the associated orec as owned for read or write. The corresponding orecs are possessed by the lock holder until it completes the critical section. Meanwhile, hardware transactions may detect potential conflicts through checking the associated orecs, and explicitly abort itself when conflicting with the software transaction. More specifically, a global epoch counter is maintained to avoid unnecessary updates to orecs. As shown in Fig. 4, the lock-protected critical section increments the global epoch counter right after acquiring the fallback lock. Before accessing any shared locations non-transactionally, the software path firstly sets the corresponding orec entry as owned through storing current value of global epoch counter into it. At the end of the critical section, the lock holder increments the global epoch counter again to start a new epoch, which implicitly releases its ownership of the corresponding orecs. Meanwhile, a hardware transaction reads a snapshot of the global epoch counter non-transactionally before starting transactional execution and regards orecs with values strictly less than the snapshot as unowned. Thus, a hardware transaction only needs to touch the interested part of orecs for conflict detection, which means that updates to the other orecs performed by the software path will not abort it. Though there are still few spurious aborts caused by sharing of orecs due to hash collisions, the number of which is dramatically reduced compared with the signature-based approach.

Fig. 5
figure5

Software fallback approaches comparison

Adaptive instrumentation

Given that hardware transactions cannot run with the non-transactional fallback path concurrently without relatively expensive instrumentation overhead, the work by Brown (2017) proposed middle path between the the fast path (uninstrumented hardware transaction) and the non-transactional software path. A hardware transaction initially starts along the fast path. In the case that no conflicts occur, transactions could commit more efficiently on the fast path without extra instrumentation overhead. When there is a transaction exhausted its retry opportunities and acquired the fallback lock, no transaction could run concurrently on the fast path anymore. Traditionally, all the transactions on the fast path must be suspended and forced to wait for the transaction on the fallback path to complete even if they were conflict-free. The middle path provides opportunities for transactions waiting for the fallback path to move forward. The middle path is similar as the fine-grained conflict detection mechanism, which instruments each individual memory access of a transaction to detect potential conflicts with the fallback path. However, middle path is only used when there is a transaction on the fallback path, which means transactional execution could pay the overhead of instrumentation adaptively, instead of incurring instrumentation all the time or waiting for the fallback path to complete.

Auxiliary lock

The essential idea of HTM is trying to avoid unnecessary serialization of current threads through optimistic speculation. Unfortunately, there are several anti-TSX patterns, which let the concurrent threads get in to a situation where they continuously prevent each other from successfully completing the hardware speculation and eventually fall back to the lock-based software path. For example, consider two concurrent transactions with conflict-prone data access patterns which make them deterministically aborts each other. Merely let them retry will gradually exhaust their attempting count and eventually forces them to fall back to acquiring the exclusive lock, which seems the only way to terminate the continuously aborts. With such wired access pattern and inappropriate conflicts resolving policy, HTM provides no benefits but exerts extra mis-speculating overhead to the lock-based path. The phenomenon in which the threads enter significantly extended periods of non-speculative execution and are prevented from benefiting from the underlaying HTM is known as the lemming effect (Dice et al. 2008), which is dramatically considerable under high contention.

The work by Dice et al. (2016) proposed the concept of auxiliary lock. Conflicted transactions are serialized to rejoin the hardware path instead of falling back to the software path. Thus, acquisitions to the global lock are dramatically reduced, which further alleviates the lemming effects.

summary

As shown in Fig. 5, in comparison with the other approaches listed in this section, the main drawback of eager subscription is its poor parallelism. Eager subscription conservatively forces transactional execution to abort when the fallback path is active. In the contrast, lazy subscription obtains greater parallelism. However, working with lazy subscription, intermediate updates of user-specified transactional code region might become visible to concurrent threads when it is running on the software fallback path. This security hazard makes lazy subscription less robust than the other solutions. The fine-grained methods require instrumentation to each individual memory operations inside the transactional code region to detect potential data conflicts between transactional execution and the non-transactional fallback.

Performance tuning techniques

Profiling tools for HTM

Hardware transactions that are frequently abort due to inappropriate design patterns, such as inherent data conflicts, involving TSX-anti instructions, containing over-sized memory footprint, etc., will result in performance bugs. It is typically not intuitive to find performance bugs merely with code reviewing. Though Intel’s TSX extension provides a status code to report the abort reason of a hardware transaction, it is too coarse to provide intuitive optimization guidelines for transactional programs. For example, the status code may explain a transaction abortion as data conflict, more detailed information, such as at which memory location the data conflict occurs, which transaction has participated the conflict, etc., cannot be obtained from the status code. Without fine-grained information, programmer cannot tune the code precisely. When HTM is applied to larger-scale programs, it is necessary to have profiling tools to identify and explain the performance anomalies in hardware transactions. Several inherent hardware constrains of HTM impede directly profiling hardware transactions with traditional profilers. For example, the write set of a hardware transaction cannot be tracked outside the transaction, otherwise the transaction will abort. Also due to the hardware limitations, interrupts will abort hardware transactions, which makes single step debugging impossible for speculative code regions. Additionally, all the memory access information of a hardware transaction will be immediately discarded after aborting.

Addressing the aforementioned challenges in debugging hardware transactions, Liu proposed TSXProf in Liu et al. (2015). TSXProf adopts a two-phase record and replay scheme. In the first phase, the real-time ordering information is recorded. Then in the second phase, the program is replayed over software transactional memory based runtime following the order tracked in the first phase. As software memory transactions are free from the hardware constrains of the hardware ones, TSXProf achieves to incorporate existing profiling tools targeting software transactional memory (Chakrabarti et al. 2011; Zyulkyarov et al. 2010; Ansari et al. 2009) to perform off-line diagnose towards performance bugs aroused from hardware transactions.

TxSampler Wang et al. (2019) only needs one phase execution and supports on-line profiling. TxSampler achieves lightweight hardware transaction profiling through utilizing the performance monitoring units (PMU) of Intel processors (Adhianto et al. 2010). More specifically, TxSampler collects relevant PMU samples, such as samples triggered inside a speculative path, inside a lock-protected fallback path, inside a lock-waiting code region, etc, and performs effective analysis to identify hot sections that consume more CPU cycles. The key insight of TxSampler is its decision-tree based optimization guidance for TSX programs. TxSampler firstly quantifies the execution time of critical sections and further analyzes decomposed time consumption to determine if the pinpointed bottleneck is caused by aborts of hardware transactions. Further, per-byte memory load and store events are sampled to locate the effective memory address that causes memory contention. Byte-level memory access tracking also enables false-sharing analysis. In respective with reconstructing the call path for a abort point, TxSampler firstly utilizes calling stack unwinding to recover the call path from the main function to the initialization point of the aborted transaction. Then, TxSampler leverages Last Branch Buffer (LBR) in Intel CPUs to recover the call path from the initialization point of the transaction to the instruction triggers the abortion. Through combining the two call path, TxSampler achieves to recover full call path from the entry point of the program to the abort point. Note that, conventional calling stack cannot record the calling path inside a hardware transaction.

Adaptive tuning of the fallback strategies

As we mentioned in the previous section, there is typically a software-based non-speculative fallback path in accompany with the speculative hardware path when building applications upon a best-effort HTM. Besides designing the fallback, programmers need to take responsibilities for determining the moment to activate it, which also exerts significant impacts on the overall performance. Challenges in deciding the activation point of the fallback path arise from the dramatically high heterogeneity of the TM workloads, which makes it impossible to find a static fallback regulation that could fits all the possible circumstances. The suboptimal performance of static fallback configurations motivates researchers dedicating efforts to adaptive tuning techniques.

One of the tuning knobs in transactional programs targeting best-effort HTM platform is the retry threshold. Workloads with different contention levels may benefit form radically different threshold configurations. For low-contention workloads, higher threshold could get more work done on the fast hardware path. However, for high-contention workloads, overheads of transaction aborts caused by data conflicts frequently outweigh the performance gains acquired through raising the retry threshold.

Another parameter that should be dynamically configured is the policy to cope with the capacity aborts. When the underlying hardware lacks sufficient cache lines to maintain read or write set of a hardware transaction, capacity abort occurs. Unlike handling conflict aborts, retrying transactions with big-footprint that cannot meet the cache size restriction is fruitless. However, transactions with moderate size of footprint could even occasionally suffer from capacity aborts, which scenarios are highly related with the state of the underlying shared memory and may benefit from retrying the failed runs.

The work by Diegues and Romano (2014) proposed TUNER, which exploits reinforcement learning techniques to conduct adaptive configuration of the transactional program. A predominant feature of reinforcement learning is that it does not require any off-line training phase, which allows for performing dynamic tuning of TSX in a workload-oblivious manner. Configurations are refined dynamically based on the performance feedback of the program. Moreover, to cope with heterogeneity of transactions, configuration of each atomic block included in the program is tuned individually instead of using a global one. TUNER reserves a copy of parameters for each atomic block, which could be identified via the program counter. Before entering an atomic block, the thread fetches the configuration parameters from TUNER and set them up. After leaving the atomic block, TUNER collects performance statistics gathered during its execution, such as cycles consumed, abort rates, etc. Based on the feedback information, TUNER makes a decision on either keeping the latest used configuration or exploring an alternative one. The previously discussed dynamic tuning procedure could be regarded as a classical reinforcement learning problem named multi-armed bandit (Sutton and Barto 1998), a solution to which is presented in Auer et al. (2002).

The work by Brown et al. (2016) investigates the performance of hardware transaction execution on NUMA machines. NUMA machines present significantly increased inter-socket communication latency, this negative performance impacts is further amplified when considering hardware transaction execution. Increased cross-socket communication latency enlarges the conflict-window of hardware transactions, which eventually leads to performance drop due to more frequently data conflicts. Addressing this problem, (Brown et al. 2016) adaptively tuning transaction execution through throttling threads as necessary. In particular, performance of hardware transaction execution is periodically profiled to determine if constrain transactions within single socket is more efficient. Additionally, a time sharing-based mechanism is constructed to provide progress guarantees for workloads desire fairness among all the sockets.

Footprint reduction

As an opportunistic scheme, HTM may suffer from high abort rate, especially when processing workloads with high-contention. The high abort rate will further result in expensive retry overhead, which significantly constrains the power of HTM. In addition, memory footprint of hardware transactions are restricted by the size of the private cache (Armstrong et al. 2018). Huge transactions will trigger CAPACITY abort and eventually fall back to the software path. Since huge transactions will hold the global lock for relatively longer duration, it will prevent more speculative transactions from successfully committing or slow down them. Even transactions with moderate footprint will suffer from high abort overheads under contention. An exciting idea is partitioning one big transactions into multiple small ones to reduce the transaction abort rate (Avni and Kuszmaul 2014; Xiang and Scott 2013, 2012).

Xiang proposed ParT in Xiang and Scott (2015) to reduce cache overflow abortions through partitioning transactional operations into read-dominant planning phase and write-dominant completion phase. An application-specific validator object is adopted in ParT to maintain the overall atomicity between the planning phase and the completion phase. The planning phase could be constructed without transactional execution, with only the completion phase remains as hardware transaction. The validator object carries information across the planning phase into the completion phase to enable validation over the planning operations. Thus, completion phase could execute with reduced memory footprint if the planning work remains valid. Otherwise, completion phase may benefit from faster retry with possibly preserved parts of the planning work.

The work by Xin Wang et al. (2017) proposed Eunomia, which tries to reduce the re-execution overhead through partitioning huge transactions into multiple smaller ones. Given that most data conflicts are located in several parts of a transaction, Eunomia makes efforts to resolving conflicts via performing re-execution at a finer granularity instead of retrying the entire transaction. Obviously, reducing the retry code path may provide considerable potential performance benefits. However, transaction decomposition fundamentally undermines the atomicity provide of the original transaction. To address which problem, Eunomia constructs a version-based consistency validation mechanism to provide atomicity guarantees for the partitioned transactions. The version number denotes the generation of the data touched by a transaction, and the decomposed transactions could determine to retry partially or entirely by checking the version number. Thus, Eunomia achieves performing transaction decomposition without incur any inconsistency.

Memory allocations impacts

HTM tracks write-set in the L1 cache and relies on cache coherence protocol to detect data conflicts. As a consequence, even in the absence of data conflicts, cache line false sharing among concurrent threads will cause unnecessary transaction abortions (false abortions). In addition to the abortions caused by invalidation message, cache line eviction will also force hardware transaction to abort. The work by Li and Gulila (2019) addresses the false abortion problem through redesigning the memory allocator to place objects that are likely to be accessed together by different threads in separated cache lines. Despite false sharing among multiple threads, conflict misses (Hill and Smith 1989) may even result in consistent abortions in a single-threaded execution. The work by Dice et al. (2015) studies the impacts of the placement policies of memory allocators on hardware transactions. Instead of modifying the hardware (Sanchez and Kozyrakis 2010) to alleviate the cache underutilization problem, evaluation by Dice et al. (2015) shows that index-aware memory allocator could reduce conflict misses and eventually improve HTM performance. Meanwhile, randomization of memory allocation sizes, which disrupts the regularity in cache line placement, may also provide performance benefits.

General guidelines for transactional programming with HTM

Transactional memory is an promising programming paradigm for parallel shared-memory application, it provides an intuitive high-level abstraction for programmers to specify atomic code regions as an alternative to the conventional thread-level synchronization management. Programming Application domains that could benefit from transactional programming have been comprehensively studied in the literature with software transactional memory (Minh et al. 2008). With the release of commercial processors that providing hardware support for transactional execution, practical programming guidelines for hardware transactional memory attracts increasing attention.

There are several constrains that are unique to hardware transactional memory due to its inherent resource limitations, which result in several programming guidelines specific to it. In particular, the work by Bonnichsen et al. (2015) suggests to arrange the first transactional modification to the shared memory as close as possible to the end of a transaction. Minimizing such distance will reduce the time a transaction exposed to data conflicts with the others. Additionally, stampede (Nguyen and Pingali 2017) studies the performance impacts of transaction scheduling. Transaction scheduling determines the relationship between the low-level threads and the memory transactions. In several transactional memory workloads, stampede has managed to achieve significant performance improvement via decoupling transaction scheduling with transaction execution. With decoupled transaction scheduling, a thread is capable of temporally suspending an aborted transaction and issuing another transaction instead of insisting on one transaction. Thus design considerations are overlapped with the workstealing (Blumofe et al. 1996) techniques for traditional parallel programs.

Exploiting benefits of aborted transactions

Due to its best-effort nature, HTM provides no progress guarantee. Despite making no changes to semantic states, a failed transaction may also introduce significant performance benefits if it could warm up the cache and branch predictor sufficiently before aborting (Izraelevitz et al. 2016). Motivated by this observation, Joseph proposed AAHTM (Always-Abort HTM) in Izraelevitz et al. (2017). As its name implies, hardware transactions on AAHTM never commit, but serve as program-controlled prefetcher to warm up hardware structures like branch predictor, cache, etc., to accelerate subsequent executions over the fetched data. Incorporated with traditional locks, AAHTM could replace existing busy waiting logic in lock acquisition to accelerate execution of the lock-protected critical section. To make much more use of AAHTM, the number of hardware transitions severing as prefetchers is deliberately constrained to reduce transaction abortions caused by contention. When applied to particular type of lock with determined lock acquisition order, such as ticket lock, AAHTM could be used more precisely to enable speculative prefetching only for threads close to acquiring the lock.

Incorporating with non-volatile memory

Emerging non-volatile memory (NVM) technologies offer promising features such as byte-addressability, large storage capacity, data persistence, etc (Burr et al. 2010; Zhang et al. 2018, 2017a, b; Apalkov et al. 2013). In particular, Intel release its Optane DC Persistent Memory in the middle of 2018, which could be directly attached to the memory bus and enable software to manage durable data with conventional load/store instructions (Izraelevitz et al. 2019; Peng et al. 2019). Durable memory transactions are popular programming models for implementing effective byte-granularity persistent data management and reasoning about the correctness of persistent memory programming (Volos et al. 2011). However, few durable memory transactions have managed to boost the transactional execution tasks with hardware transactional memory.

There are several challenges in combining HTM with NVM, in particular, Intel’s TSX-based RTM and its DC Persistent Memory. The TSX-based RTM records transactional memory accesses in the private CPU cache, which is out of the persistent domain in the memory hierarchy with persistent memory acting as the main memory. To reason about a correct persistent write, software needs to write back the touched cache line explicitly via issuing instructions like clflush, clflush-opt, clwb, etc. Both clflush and clflush-opt will invalidate the target cache line. As a consequence, issuing clflush and clflush-opt inside a hardware transaction will inevitably abort the transaction. In addition, clwb is merely a performance optimization, which provides no guarantee for retaining the cache line in the cache hierarchy. According to Intel’s official description, whether clwb will trigger transaction abort is implementation-specific. In other words, such an optimized instruction is still not expected to be used in transactional code region. Merely deferred the cache line flush operations after committing the transaction is still problematic. It is well-known that cache line evictions are subject to black-box hardware cache policies, which is out of the controlling scope of software. When a hardware transaction has managed to making semantic changes to the cache hierarchy, its updates immediately visible to the other transactions, which makes it impossible for software to reason about whether the value a read operation returns has been effectively presented in persistent memory. Moreover, when transactional updates become subject to hardware-controlled cache evictions, we cannot specify any ordering constrains on the software-side, which also makes it ambiguous to build correct byte-level durable data management (Wu et al. 2019).

There has been several research efforts towards building durable memory transactions with hardware transactional memory, DudeTM Liu et al. (2017), cc-HTM Giles et al. (2017) and NV-HTM Castro et al. (2018). The other research efforts, such as PHTM Avni et al. (2015), DHTM Joshi et al. (2018), etc., that require hardware features not available with current commodity processors are not discussed in this survey. They generally target hybrid DRAM/NVM main memory and rely on DRAM-based shadow memory as an intermediate layer between the volatile CPU cache and the non-volatile main memory. Note that, the hybrid DRAM/NVM configuration mentioned here is different from the Memory Mode of Intel’s Optane DC Persistent Memory. The later configuration merely utilize NVM as a pool of volatile memory with larger capacity, software layer will be unaware of persistent memory. With respective to he physical layout of the hybrid DRAM/NVM memory configuration, both DRAM and NVM are addressed in single address space. More specifically, operating system will perceive two kinds of main memory and be capable of arranging byte-granularity durable data modification for the non-volatile memory. With thus configuration, system software could operate DRAM and NVM within single virtual address space.

In all the previously mentioned durable transaction systems, DRAM acts as an intermediate layer between CPU cache and NVM. With such kind of layout, the hardware-controlled natural cache evictions will be buffered by the DRAM, whereas data movement between DRAM and NVM can be fully controlled by software. Hardware transactions can be supported as in legacy DRAM-only systems, and durability of transactions can be enforced asynchronously between DRAM and NVM.

To construct the shadowing middle layer, cc-HTM adopts software-based address translation table. Transactional memory accesses are instrumented to maintain the alias table. More specifically, a transactional read will be redirected to query the alias table to acquire the effective shadowing locations to get the latest value. If the target location is missed in the alias table, the read operation will load from NVM and fill the corresponding table entry. Meanwhile, transactional write will be performed in the shadowing location given by the alias table. Moreover, each individual entry in the alias table is attached with a timestamp to support entry evictions when the table is full.

Beyond software-based address translation, DudeTM tries to boost address translation with hardware support (Belay et al. 2012). Modern Intel CPUs have managed to enable second address translation with extended page table (EPT). With the framework proposed by dune (Belay et al. 2012), user-level programs are capable of modifying the user page table without triggering any privilege violation exception. DudeTM decouples memory transaction execution and data persistence into separated phase, thus it could install different versions of user page tables respectively for different running phases. Meanwhile, EPT is managed by operating system to map guest physical pages to DRAM and NVM. The main drawback of hardware-paging is the overhead in modifying the user page table. Each time the user page table updated, DudeTM has to perform a TLB shutdown to flush TLB entries of all processors. Additionally, occupation of the EPT hardware resource prevents such mechanism from being applied to virtualized computation environments.

NV-HTM’s shadowing DRAM, which referred as working copy, is automatically generated by the operating system. NV-HTM assumes durable data reside in a persistent heap, and expose the heap to hardware transactions as private mapping of the physical non-volatile memory. According to the semantics of private memory mapping, updates to thus memory regions will not be propagated to the physical page, the operating system will make a volatile copy, the working copy, of the page when it is touched at the first time. As a consequence, though updates to the persistent heap issued by hardware transactions will be buffered in the working copies and the the corresponding pages stored in non-volatile memory remains clean. A separated checkpointing process will synchronize the updates to the persistent memory. However, the process of establishing the working copies must be non-transactional as the first write to a private copy-on-write page will trigger page fault that will abort the transaction. Meanwhile, a page fault triggered inside hardware transaction will bypass the exception handling procedure of the operating system. As a consequence, with the working copy, a hardware transaction will have to exhaust its retry opportunities and fallback to the slow software path each time it writes to a new persistent page. Therefore, only transactional programs with high space locality will benefit from the working copy generated from private mapping.

All of the previously mentioned durable transactions leverage log-based approaches to ensure crash-consistency, which makes it is necessary to instrument each individual transactional write to record the write set. It will inevitably enlarge the memory footprint and results in higher abort rate. Meanwhile, they adopt asynchronous checkpointing mechanism to synchronize the shadowing DRAM pages and the NVM-side home locations. Serialization order of the transactions must be deliberately preserved during the checkpointing process. The ordering of transactions can be acquired with the time stamp counter on Intel’s CPUs via the RDTSC instruction.

Considering NVM-only configurations where DRAM device is fully replaced, these durable transaction solutions could still work. However, it is more feasible to adopt DRAM as the intermediate shadowing layer due to its relatively faster write speed and significant higher write endurance.

Conclusion

Transactional memory simplifies parallel programming greatly. The rapid development of transactional memory, especially the advancement in hardware support for transaction execution makes transactional memory promising for achieving great breakthroughs in massively parallel computing. In this paper, we comprehensively discuss the software-side efforts that enforcing progress guarantees for commodity best-effort hardware transactional memory. In addition, we review several contributions to TSX performance optimization including both profiling tools and adaptive performance tuning techniques. Finally, we extensively analyze the inherent hardware limitations of hardware transactional memory that challenge constructing durable memory transactions for non-volatile memory. We expect the combination of HTM and NVM will make great differences in future computation systems.

References

  1. Abadi, M., Birrell, A., Harris, T., Isard, M.: Semantics of transactional memory and automatic mutual exclusion. ACM Trans. Program. Lang. Syst. 33(1). https://doi.org/10.1145/1889997.1889999

  2. Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: Hpctoolkit: Tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 22(6), 685–701 (2010)

    Google Scholar 

  3. Ansari, M., Jarvis, K., Kotselidis, C., Lujan, M., Kirkham, C., Watson, I.: Profiling transactional memory applications. In: 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing, pp. 11–20. (2009)

  4. Apalkov, D., Khvalkovskiy, A., Watts, S., Nikitin, V., Tang, X., Lottis, D., Moon, K., Luo, X., Chen, E., Ong, A., et al.: Spin-transfer torque magnetic random access memory (stt-mram). J. Emerg. Technol. Comput. Syst. 9(2) (2013). https://doi.org/10.1145/2463585.2463589

  5. Armstrong, N., Felber, P., Gramoli, V.: Space-constrained data structures for htm (2018)

  6. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2), 235–256 (2002)

    Article  Google Scholar 

  7. Avni, H., Kuszmaul, B.C.: Improving htm scaling with consistency-oblivious programming. In: 9th Workshop on Transactional Computing, TRANSACT, vol. 14 (2014)

  8. Avni, H., Levy, E., Mendelson, A.: Hardware transactions in nonvolatile memory. In: Proceedings of the 29th International Symposium on Distributed Computing - Volume 9363, ser. DISC 2015, pp. 617–630. Springer, Berlin (2015). https://doi.org/10.1007/978-3-662-48653-541

  9. Belay, A., Bittau, A., Mashtizadeh, A., Terei, D., Mazières, D., Kozyrakis, C.: Dune: Safe user-level access to privileged CPU features. In: Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), pp. 335–348. USENIX, Hollywood, CA (2012). https://www.usenix.org/conference/osdi12/technical-sessions/presentation/belay

  10. Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. J. Parallel Distrib. Comput. 37(1), 55–69 (1996)

    Article  Google Scholar 

  11. Bonnichsen, L.F., Probst, C.W., Karlsson, S.: Hardware transactional memory optimization guidelines, applied to ordered maps. In: 2015 IEEE Trustcom/BigDataSE/ISPA, vol. 3, pp. 124–131. (2015)

  12. Brown, T.: A template for implementing fast lock-free trees using htm. In: Proceedings of the ACM Symposium on Principles of Distributed Computing, ser. PODC ’17, pp. 293–302. Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3087801.3087834

  13. Brown, T., Kogan, A., Lev, Y., Luchangco, V.: Investigating the performance of hardware transactions on a multi-socket machine. In: Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures, ser. SPAA ’16, pp. 121–132. Association for Computing Machinery, New York, NY (2016). https://doi.org/10.1145/2935764.2935796

  14. Burr, G.W., Breitwisch, M.J., Franceschini, M., Garetto, D., Gopalakrishnan, K., Jackson, B., Kurdi, B., Lam, C., Lastras, L.A., Padilla, A., et al.: Phase change memory technology. J. Vac. Sci. Technol. B Nanotechnol. Microelectron. Mater. Process. Measure. Phenomena 28(2), 223–262 (2010)

    Google Scholar 

  15. Calciu, I., Shpeisman, T., Pokam, G., Herlihy, M.: Improved single global lock fallback for best-effort hardware transactional memory. In: Transaction on 2014 Workshop. ACM (2014)

  16. Castro, D., Romano, P., Barreto, J.: Hardware transactional memory meets memory persistency. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 368–377 (2018)

  17. Chakrabarti, D.R., Banerjee, P., Boehm, H., Joisha, P.G., Schreiber, R.S.: The runtime abort graph and its application to software transactional memory optimization. In: International Symposium on Code Generation and Optimization (CGO 2011), pp. 42–53. (2011)

  18. Dalessandro, L., Carouge, F., White, S., Lev, Y., Moir, M., Scott, M.L., Spear, M.F.: Hybrid norec: a case study in the effectiveness of best effort hardware transactional memory. SIGPLAN Not. 46(3), 39–52 (2011). https://doi.org/10.1145/1961296.1950373

  19. Dice, D., Kogan, A., Lev, Y.: Refined transactional lock elision. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’16, pp. 19:1–19:12. ACM, New York, NY, USA (2016). https://doi.org/10.1145/2851141.2851162

  20. Dice, D., Herlihy, M., Lea, D., Lev, Y., Luchangco, V., Mesard, W., Moir, M., Moore, K., Dan, N., Sun, M.: Applications of the adaptive transactional memory test platform. Applications of the Adaptive Transactional Memory Test Platform Researchgate (2008)

  21. Dice, D., Harris, T., Kogan, A., Lev, Y.: The influence of malloc placement on tsx hardware transactional memory. arXiv:1504.04640 (2015)

  22. Diegues, N., Romano, P.: Self-tuning intel transactional synchronization extensions. In: 11th International Conference on Autonomic Computing (ICAC 14), pp. 209–219. USENIX Association, Philadelphia, PA (2014). https://www.usenix.org/conference/icac14/technical-sessions/presentation/diegues

  23. Diegues, N., Romano, P., Rodrigues, L.: Virtues and limitations of commodity hardware transactional memory. In: 2014 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT), pp. 3–14. (2014)

  24. Giles, E., Doshi, K., Varman, P.: Continuous checkpointing of htm transactions in nvm. SIGPLAN Not. 52(9), 70–81. (2017). https://doi.org/10.1145/3156685.3092270

  25. Hammarlund, P., Martinez, A.J., Bajwa, A.A., Hill, D.L., Hallnor, E., Jiang, H., Dixon, M., Derr, M., Hunsaker, M., Kumar, R., Osborne, R.B., Rajwar, R., Singhal, R., D’Sa, R., Chappell, R., Kaushik, S., Chennupaty, S., Jourdan, S., Gunther, S., Piazza, T., Burton, T.: Haswell: The fourth-generation intel core processor. IEEE Micro 34(2), 6–20 (2014)

    Article  Google Scholar 

  26. Herlihy, M., Moss, J.E.B.: Transactional memory: architectural support for lock-free data structures. In: Proceedings of the 20th annual international symposium on computer architecture, ser. ISCA ’93, pp. 289–300. ACM, New York, NY, USA (1993). https://doi.org/10.1145/165123.165164

  27. Hill, M.D., Smith, A.J.: Evaluating associativity in cpu caches. IEEE Trans. Comput. 38(12), 1612–1630 (1989)

    Article  Google Scholar 

  28. Intel Corporation: Intel 64 and IA-32 Architectures Software Developer’s Manual (2016)

  29. Izraelevitz, J., Kogan, A., Lev, Y.: Implicit acceleration of critical sections via unsuccessful speculation. 11th ACM SIGPLAN Wkshp. on Transactional Computing, TRANSACT, vol. 16 (2016)

  30. Izraelevitz, J., Xiang, L., Scott, M.L.: Performance improvement via always-abort htm. In: 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 79–90 (2017)

  31. Izraelevitz, J., Yang, J., Zhang, L., Kim, J., Liu, X., Memaripour, A., Soh, Y.J., Wang, Z., Xu, Y., Dulloor, S.R., et al.: Basic performance measurements of the intel optane dc persistent memory module. arXiv:1903.05714 (2019)

  32. Joshi, A., Nagarajan, V., Cintra, M., Viglas, S.: Dhtm: Durable hardware transactional memory. In: 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 452–465 (2018)

  33. Li, X., Gulila, A.: Optimised memory allocation for less false abortion and better performance in hardware transactional memory. Int. J. Parallel Emerg Distrib. Syst. (2019). https://doi.org/10.1080/17445760.2019.1605605

  34. Liu, Y., Gottschlich, J., Pokam, G., Spear, M.: Tsxprof: Profiling hardware transactions. In: 2015 International Conference on Parallel Architecture and Compilation (PACT), pp. 75–86. (2015)

  35. Liu, M., Zhang, M., Chen, K., Qian, X., Wu, Y., Zheng, W., Ren, J.: Dudetm: building durable transactions with decoupling for persistent memory. In: Proceedings of the twenty-second international conference on architectural support for programming languages and operating systems, ser. ASPLOS ’17. New York, NY, USA: Association for Computing Machinery, pp. 329–343 (2017). https://doi.org/10.1145/3037697.3037714

  36. Minh, Chi Cao, Chung, JaeWoong, Kozyrakis, C., Olukotun, K.: Stamp: Stanford transactional applications for multi-processing. In: 2008 IEEE International Symposium on Workload Characterization, pp. 35–46. (2008)

  37. Nguyen, D., Pingali, K.: What scalable programs need from transactional memory. In: Proceedings of the twenty-second international conference on architectural support for programming languages and operating systems, ser. ASPLOS ’17, pp. 105–118. Association for Computing Machinery, New York, NY (2017). https://doi.org/10.1145/3037697.3037750

  38. Peng, I.B., Gokhale, M.B., Green, E.W.: System evaluation of the intel optane byte-addressable nvm. In: Proceedings of the International Symposium on Memory Systems, ser. MEMSYS ’19, pp. 304–315. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3357526.3357568

  39. Sanchez, D., Yen, L., Hill, M.D., Sankaralingam, K.: Implementing signatures for transactional memory. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO 40, pp. 123–133. IEEE Computer Society, Washington, DC, USA (2007). https://doi.org/10.1109/MICRO.2007.24

  40. Sanchez, D., Kozyrakis, C.: The zcache: decoupling ways and associativity. In: 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 187–198. (2010)

  41. Sutton, R.S., Barto, A.G.: Reinforcement learning i: Introduction (1998)

  42. Volos, H., Tack, A.J., Swift, M.M.: Mnemosyne: lightweight persistent memory. SIGPLAN Not. 46(3), 91–104 (2011). https://doi.org/10.1145/1961296.1950379

  43. Wang, Q., Su, P., Chabbi, M., Liu, X.: Lightweight hardware transactional memory profiling. In: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’19. New York, NY, USA: Association for Computing Machinery, pp. 186–200. (2019). https://doi.org/10.1145/3293883.3295728

  44. Wang, X., Zhang, W., Wang, Z., Wei, Z., Chen, H., Zhao, W.: Eunomia: scaling concurrent search trees under contention using htm. In: Sigplan Symposium on Principles and Practice of Parallel Programming

  45. Wu, Z., Lu, K., Zhang, W., Nisbet, A., Luján, M.: POSTER: quiescent and versioned shadow copies for NVM. In: 28th International Conference on Parallel Architectures and Compilation Techniques, PACT 2019, Seattle, WA, USA, September 23-26, 2019. IEEE, pp. 491–492 (2019). https://doi.org/10.1109/PACT.2019.00060

  46. Xiang, L., Scott, M.L.: Software partitioning of hardware transactions. ACM SIGPLAN Notes 50(8), 76–86 (2015)

    Article  Google Scholar 

  47. Xiang, L., Scott, M.L.: Compiler aided manual speculation for high performance concurrent data structures. In: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 47–56 (2013)

  48. Xiang, L., Scott, M.L.: Mspec: A design pattern for concurrent data structures. 7th SIGPLAN Wkshp. on Transactional Computing (TRANSACT), New Orleans, LA (2012)

  49. Yoo, R.M., Hughes, C.J., Lai, K., Rajwar, R.: Performance evaluation of Intel\(^{\textregistered }\) transactional synchronization extensions for high-performance computing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC ’13. New York, NY, USA: Association for Computing Machinery (2013). https://doi.org/10.1145/2503210.2503232

  50. Zardoshti, P., Zhou, T., Balaji, P., Scott, M.L., Spear, M.: Simplifying transactional memory support in c++. ACM Trans. Archit. Code Optim. 16(3), (2019). https://doi.org/10.1145/3328796

  51. Zhang, W., Lu, K., Wang, X.: Versionized process based on non-volatile random-access memory for fine-grained fault tolerance. Front. IT & EE 19(2), 192–205 (2018). https://doi.org/10.1631/FITEE.1601477

  52. Zhang, W., Lu, K., Wang, X., Jian, J.: Fast persistent heap based on non-volatile memory. IEICE Trans. 100-D(5), 1035–1045 (2017). https://doi.org/10.1587/transinf.2016EDP7429

  53. Zhang, W., Lu, K., Luján, M., Wang, X., Zhou, X.: Fine-grained checkpoint based on non-volatile memory. Front. IT & EE, 18(2), 220–234 (2017). https://doi.org/10.1631/FITEE.1500352

  54. Zyulkyarov, F., Stipic, S., Harris, T., Unsal, O.S., Cristal, A., Hur, I., Valero, M.: “Discovering and understanding performance bottlenecks in transactional applications. In: 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 285–294 (2010)

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper. This work is supported by National High-level Personnel for Defense Technology Program (2017-JCJQ-ZQ-013), NSF 61902405.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Kai Lu.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wu, Z., Lu, K., Wang, R. et al. A survey on optimizations towards best-effort hardware transactional memory. CCF Trans. HPC 2, 401–414 (2020). https://doi.org/10.1007/s42514-020-00049-2

Download citation

Keywords

  • Transactional memory
  • Parallel programming
  • Concurrency control