Investigating Transactional Memory for High Performance Embedded Systems
- 512 Downloads
We present a Transaction Management Unit (TMU) for Hardware Transactional Memories (HTMs). Our TMU enables three different contention management strategies, which can be applied according to the workload. Additionally, the TMU enables unbounded transactions in terms of size. Our approach tackles two challenges of traditional HTMs: (1) potentially high abort rates, (2) missing support for unbounded transactions. By enhancing a simulator with a transactional memory and our TMU, we demonstrate that our TMU achieves speedups of up to 4.2 and reduces abort rates by a factor of up to 11.6 for some of the STAMP benchmarks.
KeywordsTransactional memory Contention management Unbounded transactions Embedded systems
To fully utilize multicores, the ability to generate efficient parallel code is essential. Because in-depth parallelization has proven to be very error prone, alternative synchronization mechanisms, such as transactional memories (TM) , evolved to be a subject of research.
Implementations of hardware transactional memories were integrated into commercial high performance chips from Intel and IBM. Despite their benefits, current available commercial HTMs (e.g. Intel’s TSX) do not meet the high requirements of embedded systems. Commercial HTMs statically implement contention management strategies, which can lead to high abort rates. To meet the high requirements concerning power consumption, embedded systems depend on low abort rates. Another disadvantage of COTS HTMs is that they bound transactions in several ways. The size of a transaction is limited by the capacity and associativity of the L1 cache. In addition, transactions are aborted by events like interrupts, which limits their duration. This can negatively impact performance and complicates usability, leading to more programming errors, which is unacceptable for embedded systems due to the increasing demands of computational power and the scarce resources provided.
For our work, we developed two challenges: (1) Lowering abort rates by providing effective contention management. (2) Enabling unbounded transactions in terms of size. We want to achieve these goals by implementing a Transaction Management Unit (TMU). The main contributions of this paper are: (1) The design of a flexible TMU, which enables the user to apply different contention management strategies. (2) A solution to enable unbounded transactions, considering their size.
The rest of this paper is structured as follows: After giving an overview on the state of the art of transactional memories, we will describe our TMU. In the following section, our proposal is evaluated. At the end of this paper, we discuss related work and conclude by summing up our results and describing future work.
2 State of the Art
A transaction is a sequence of instructions that is monitored by the transactional memory system. The beginning and the end of a transaction are usually marked by special instructions. To ensure a correct execution of the program, the transactional memory system has to ensure that every transaction fulfills the following three criteria: (1) Transactions have to be executed atomically, which means that they commit or abort as a whole. (2) Transactions have to run isolated, which means that they do not impact each other. (3) The transactional executions have to be serializable, which means that a sequential execution with a matching output exists.
To fulfill these criteria, the transactional memory system has to ensure that values, which are consumed in a transaction, are not modified by another transaction running in parallel. For this purpose, read and write accesses in a transaction are logged at cache line granularity in a read and write set. A conflict occurs whenever a write access of a transaction tries to manipulate a cache line, which is already added to the read or write set of a competing transaction. A conflict also occurs, when a read access tries to read a cache line already contained in the write set of another transaction. To keep track of read and write accesses inside of transactions, HTMs usually utilize the cache coherence protocol.
If a conflict is detected, it has to be resolved by the HTM, by aborting all but one of the conflicting transactions. This involves setting back all the memory modifications performed during the transactions (rollback). Additionally, the read and write sets have to be cleared and the register files have to be restored. Afterwards, the aborted transactions have to be restarted.
Another source of transactional aborts are physical limits of the hardware, or interrupts. Physical restrictions are usually based on the size or associativity of the L1 caches, which are used to store the transactional read and write sets. Most HTM systems do not implement any mechanisms to allow the read or write set to overflow the size or associativity of the cache. This limits the transactions in terms of consumed and modified cache lines. Interrupts limit a transaction concerning its duration. Frequent transaction aborts because of physical limits or interrupts can be critical for the performance, since a significant amount of work has to be discarded.
Due to these physical restrictions, a programmer usually has to provide an alternative path of execution utilizing different synchronization mechanisms, which are not affected by physical hardware boundaries. The fallback path is a weak spot for transactional memory usage. It uses alternative synchronization, which can be error prone and reduce performance, depending on the depth of parallelization. Additionally, it takes away one of the main advantages, which is the easy usability, because the fallback path increases the complexity of the parallel code. Providing an efficient alternative would render transactional memory superfluous.
3 Transaction Management Unit
3.1 Hardware Integration
3.2 Contention Management Strategies
priority: The transaction with the higher priority is allowed to continue. This strategy allows to enforce an ordered commit of the transaction and a prioritization of a transaction over others. We would like to utilize this strategy in the future to enable various real-time strategies (e.g. mixed criticality).
timestamp : The transaction, that started first can carry on. Taking the timestamp of a transaction into account reduces the indeterminism concerning the aborts and guarantees progress.
commit: The transaction on the core, which committed fewer transactions is able to continue. This strategy leads to a more balanced execution, because cores, which were not able to commit transactions, are favored when conflicts occur.
Unbounded transactions always overrule the contention management strategy.
3.3 Unbounded Transactions
Transactions have to abort whenever a transaction’s read or write set exceeds the size or associativity of the L1 cache. Therefore, most HTMs have to provide a fallback mechanism consisting of an alternative execution path. This does not only make it harder for the programmer to write efficient and correct code, it can also be crucial for performance because of the loss of already computed work. The TMU monitors the transactional execution and sets a bit whenever a transaction is forced to run in unbounded mode. Whenever the bit is set, conflicts are resolved favoring the unbounded transaction. Therefore, the transaction will never be aborted, which means it does not have to be rolled back. Since it is guaranteed that the transaction will succeed, the backup version of the cache line is not needed, which allows the transaction to use the entire cache hierarchy. The TMU can only support one unbounded transaction at the time. If another transaction or thread tries to perform a conflicting access, the TMU takes actions to suppress them, e.g. by stalling the core.
For the implementation of our approach, we utilized the gem5 simulator . We selected the STAMP benchmark suite  to evaluate our approach. In this section we will describe in detail our evaluation methodology followed by the presentation of our results.
4.1 Simulation Methodology
L1 data cache
L1 data cache assoc.
-v32 -r1024 -n2 -p20 -s0 -i2 -e2
-g256 -s16 -n16384
-a10 -l4 -n2038 -s1
-m40 -n40 -t0.05 -i inputs/random2048-d16-c16.txt
-s13 -i1.0 -u1.0 -l3 -p3
-n2 -q90 -u98 -r16384 -t4096
-a20 -i inputs/633.2
4.2 Baseline Transactional Memory System
For our baseline, we implemented a transactional memory system into the ARM-based gem5 simulator . The implementation of the interface is similar to those offered by Intel TSX  and the newly announced ARM TME . Our interface allows the programmer to explicitly start and end transactions using the corresponding commands, which are provided by our transactional memory system.
Our baseline HTM detects and resolves conflicts eagerly: Conflicts are detected instantly when the conflicting memory access occurs (in contrast to detecting them at commit time). When a conflict occurs, the transaction, that detects the conflict, aborts to resolve it.
We provide a fallback path with regular POSIX Thread synchronization in our baseline implementation. In our baseline as well as in the runs supported by our TMU, a transaction is executed in the fallback path, if the attempt to execute the transaction failed 100 times. Trying to re-execute a transaction for 100 times makes sense, because the execution of a transaction in the fallback path prohibits the other cores to execute work. Whenever the read or write of a transaction in our baseline exceeds the L1 cache, it is directly executed in the fallback path.
bayes: For the benchmark bayes, we were able to beat the baseline execution for most executions. Up to 62 unbounded transactions are executed, which shows that it is beneficial to implement unbounded transactions concerning their size. We are able to achieve the best speedup for the execution with four cores and the timestamp strategy. Considering the entire run time of the benchmark, the part in which transactions were executed, is extremely short.
genome: The benchmark genome scales quite well. Our system produces the same results as the baseline. The features of the TMU become relevant for the executions with eight and sixteen cores. For these runs, we are able to significantly lower the number of aborted transactions. In these executions, transactions are started, which do not fit in the L1 cache and therefore have to be executed in the fallback path. Because of our TMU, we are prepared for these cases and are able to continue without having to abort them.
intruder: For the benchmark intruder, we made similar observations as with the benchmark genome. The main difference is that the positive effects of our extensions only take effect for the execution with sixteen cores. For this execution, the baseline contention management strategy performs so poorly, that we are able to lower the abort rate by a factor of 11.6. Within the execution of the benchmark, no transaction faces a capacity problem, which means we achieve the speedup only through better contention management.
kmeans: For the benchmark kmeans, we were not able to outperform the baseline. The reason for this is that no transaction, for any execution, faces a capacity problem, which means no unbounded transaction has to be executed. Additionally, the benchmark has hardly any conflicts, which eliminates the grounds of what we can improve.
labyrinth: The baseline execution for the benchmark labyrinth is below one, because the benchmark launches fairly big transactions, which do not fit in the L1 cache and cause the transaction to abort and execute in the fallback path. The already achieved computational progress is discarded, which is why the baseline execution falls below one. For this benchmark, our work is beneficial. We manage to raise the speedup above the baseline execution and one.
ssca2: The benchmark ssca2 behaves similar to the benchmark kmeans. Hardly any of the almost 50000 committed transaction aborts. None of the transactions faces an issue with the capacity of the L1 cache.
vacation: The observations, which can be made for the benchmark vacation, are similar to those of the benchmarks kmeans and ssca2. Of the 4097 committed transactions, a maximum of only about 490 transactions aborts. Additionally, all of the transactions fit into the L1 cache, which is why no unbounded transactions are needed.
yada: The baseline execution for the benchmark yada suffers from a lot of conflicts, due to poor contention management. To commit around 4900 transactions, up to 33590 aborts occur. Additionally, some transactions face a problem with the capacity of the L1 cache. Therefore, the TMU handles up to 170 unbounded transactions, which is beneficial to performance and allows us to improve the baseline execution with both strategies. Additionally, we are able to achieve speedups bigger than one.
In Fig. 3, we depicted the number of aborts for every benchmark of the STAMP benchmark suite . The line labeled as baseline depicts the baseline execution. The other lines represent an execution with a contention management strategy. Because we want to show the benefits of the implemented contention management strategies, we disabled the unbounded transactions for this evaluation.
For most of the benchmarks, we are able to lower the number of the aborts. Especially for executions with 8 and 16 cores. For these executions, the contention management strategy of the baseline performs poorly and we are able to reduce the number of aborts significantly for some benchmarks.
Our evaluation showed that our work is beneficial to a transactional memory system in terms of performance and abort rates.
5 Related Work
The authors of  focused on unbounded transactions. They proposed a permissions-only cache, which allows large transactions by only tracking read and write bits without the corresponding data. Once the permissions-only cache overflows, one of two proposed implementations, to handle unbounded transactions, can be utilized. ONE-TM-Serialized only allows one overflowed transaction at a time and stalls the other cores. ONE-TM-Concurrent allows several concurrent transactions to run in parallel, although only one transaction can run in overflowed-mode. Our approach possesses the same runtime characteristics as ONE-TM-Concurrent, the difference between the approaches lies in the management of the unbounded transaction. In contrast to , where the authors propose to use per block meta data to ensure the correct execution of the unbounded transaction, we use our TMU to manage and protect the unbounded transaction. Due to a LogTM-style  baseline transactional memory system, the authors’ proposal is also able to survive interrupting actions performed by the operating system. Contention management strategies were no subject to the authors.
The authors of  adapted an HTM for embedded systems, focusing on energy consumption and complexity. In this proposal, the authors evaluate different cache structures and three different contention management schemes (eager, lazy, forced-serial), concerning their complexity and energy consumption. The authors also provide a mechanism to support overflowing transactions (exceeding cache limits) by running them in serial mode. Because the authors are very sensitive for complexity, they only allow a simple execution mode for unbounded transactions. Therefore, running a transaction in serial mode means that all other CPUs get suspended. The overflowed transaction can now run isolated and is able to utilize the complete memory hierarchy. Our approach differs from , because we focus on performance and abort rates. Therefore, we provide a more complex execution mode for unbounded transactions. Furthermore, we offer several different and more complex contention-management strategies.
The work describing the most relevant contention management policies focus on software transactional memories (STM) [10, 11, 16, 17]. Later work has applied some of these contention management strategies to HTMs e.g. . In contrast to our work, the authors focused on using HTM as an synchronization primitive for an operating system as well as managing it in the scheduler. This made it necessary to implement a new more complex HTM. The authors focused, concerning the contention management strategies, on finding a well performing policy in most of the cases. This makes sense, since the best working policy is workload dependent, as also mentioned by the authors.
6 Conclusion and Future Work
In our work, we present a TMU, which is located in the shared L2 cache and costs approximately less than 0.05% of the space of a 2 MB L2 cache. We provide three different contention resolution policies and enable unbounded transactions. In our evaluation, we did not consider the contention management strategy, which enables priorities, because in our perspective it would not produce any interesting results concerning its execution time. The priority strategy will be of more focus in our future work. By our evaluation with the gem5 simulator  and the STAMP benchmark suite , we show that the TMU is beneficial for performance and is able to reduce the number of aborted transactions.
The work we present in this paper is the foundation to employ several other features. To further increase performance, we also would like to enable thread-level speculation, which will utilize the TMU to ensure correct execution. Our research also concerns fault tolerance utilizing transactional memory , where we will also consider investigating the use of the TMU. Because safety is a major issue for embedded systems, we want to try to utilize our TMU to enable mixed criticality and real time for hardware transactional memories.
- 1.Amslinger, R., Weis, S., Piatka, C., Haas, F., Ungerer, T.: Redundant execution on heterogeneous multi-cores utilizing transactional memory. In: Berekovic, M., Buchty, R., Hamann, H., Koch, D., Pionteck, T. (eds.) ARCS 2018. LNCS, vol. 10793, pp. 155–167. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77610-1_12CrossRefGoogle Scholar
- 2.Ananian, C.S., Asanovic, K., Kuszmaul, B.C., Leiserson, C.E., Lie, S.: Unbounded transactional memory. In: 11th International Symposium on High-Performance Computer Architecture, pp. 316–327, February 2005. https://doi.org/10.1109/HPCA.2005.41
- 3.ARM Ltd.: Transactional memory extension (TME) intrinsics. https://developer.arm.com/docs/101028/0009/transactional-memory-extension-tme-intrinsics. Accessed 13 Jan 2020
- 6.Minh, C.C., Chung, J.W., Kozyrakis, C., Olukotun, K.: STAMP: stanford transactional applications for multi-processing. In: 2008 IEEE International Symposium on Workload Characterization, pp. 35–46, September 2008. https://doi.org/10.1109/IISWC.2008.4636089
- 8.Damron, P., Fedorova, A., Lev, Y., Luchangco, V., Moir, M., Nussbaum, D.: Hybrid transactional memory. In: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 336–346 (2006). https://doi.org/10.1145/1168857.1168900
- 11.Guerraoui, R., Herlihy, M., Pochon, B.: Toward a theory of transactional contention managers. In: Proceedings of the Twenty-Fourth Annual ACM Symposium on Principles of Distributed Computing, pp. 258–264 (2005). https://doi.org/10.1145/1073814.1073863
- 12.Herlihy, M., Moss, J.E.B.: Transactional memory: architectural support for lock-free data structures. In: Proceedings of the 20th Annual International Symposium on Computer Architecture, pp. 289–300 (1993). https://doi.org/10.1145/165123.165164
- 13.Intel Corporation: Intel Transactional Synchronization Extensions (Intel TSX) Overview. https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-intel-transactional-synchronization-extensions-intel-tsx-overview. Accessed 23 Jan 2020
- 14.Moore, K.E., Bobba, J., Moravan, M.J., Hill, M.D., Wood, D.A.: LogTM: log-based transactional memory. In: 2006 The Twelfth International Symposium on High-Performance Computer Architecture, pp. 254–265 (2006). https://doi.org/10.1109/HPCA.2006.1598134
- 15.Rossbach, C.J., Hofmann, O.S., Porter, D.E., Ramadan, H.E., Aditya, B., Witchel, E.: TxLinux: using and managing hardware transactional memory in an operating system. In: Proceedings of Twenty-First ACM SIGOPS Symposium on Operating Systems Principles, pp. 87–102 (2007). https://doi.org/10.1145/1294261.1294271
- 16.Scherer, W.N., Scott, M.L.: Contention management in dynamic software transactional memory. In: PODC Workshop on Concurrency and Synchronization in Java Programs, pp. 70–79 (2004)Google Scholar
- 17.Scherer, W.N., Scott, M.L.: Advanced contention management for dynamic software transactional memory. In: Proceedings of the Twenty-Fourth Annual ACM Symposium on Principles of Distributed Computing, pp. 240–248 (2005). https://doi.org/10.1145/1073814.1073861