TOSTING: Investigating Total Store Ordering on ARM

Wrenger, Lars; Töllner, Dominik; Lohmann, Daniel

doi:10.1007/978-3-031-42785-5_10

Lars Wrenger¹²,
Dominik Töllner¹² &
Daniel Lohmann¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13949))

Included in the following conference series:

International Conference on Architecture of Computing Systems

385 Accesses

Abstract

The Apple M1 ARM processors incorporate two memory consistency models: the conventional ARM weak memory ordering and the total store ordering (TSO) model from the x86 architecture employed by Apple’s x86 emulator, Rosetta 2. The presence of both memory ordering models on the same hardware enables us to thoroughly benchmark and compare their performance characteristics and worst-case workloads.

In this paper, we assess the performance implications of TSO on the Apple M1 processor architecture. Based on various workloads, our findings indicate that TSO is, on average, 8.94% slower than ARM’s weaker memory ordering. Through synthetic benchmarks, we further explore the workloads that experience the most significant performance degradation due to TSO.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

ARM Cortex-A Series - Programmer’s Guide for ARMv8-A. ARM Limited (2015)
Google Scholar
Apple announces Mac transition to Apple silicon (2020). https://nr.apple.com/d2O2Y718J3. Accessed 22 Mar 2023
Apple’s M1 Pro, M1 Max SoCs investigated: new performance and efficiency heights (2021). https://www.anandtech.com/show/17024/apple-m1-max-performance-review. Accessed 23 Mar 2023
Apple M1 Ultra (2022). https://www.apple.com/newsroom/2022/03/apple-unveils-m1-ultra-the-worlds-most-powerful-chip-for-a-personal-computer/. Accessed 22 Mar 2023
Intel 64 and IA-32 Architectures Software Developer’s Manual - Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4. Intel (2022). https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html. Accessed 30 May 2023
Learn the architecture - Memory Systems, Ordering, and Barriers. ARM Limited (2022). https://developer.arm.com/documentation/102336/0100. Accessed 30 May 2023
Asahi Linux docs wiki (2023). https://github.com/AsahiLinux/docs/wiki. Accessed 23 Mar 2023
C++ atomic operations library (2023). https://en.cppreference.com/w/cpp/atomic. Accessed 26 Mar 2023
Rosetta Translation Environment (2023). https://developer.apple.com/documentation/apple-silicon/about-the-rosetta-translation-environment. Accessed 22 Mar 2023
Rust standard library - module std::sync::atomic (2023). https://doc.rust-lang.org/std/sync/atomic/index.html. Accessed 26 Mar 2023
SPEC CPU benchmark package (2023). https://www.spec.org/cpu2017/. Accessed 27 Mar 2023
The Standard Performance Evaluation Corporation (2023). https://www.spec.org/. Accessed 22 Mar 2023
Tsoenabler for Linux (2023). https://github.com/cyyself/m1tso-linux. Accessed 26 Mar 2023
Ali, Z., Tanveer, T., Aziz, S., Usman, M., Azam, A.: Reassessing the performance of arm vs x86 with recent technological shift of apple. In: 2022 International Conference on IT and Industrial Technologies (ICIT), pp. 01–06 (2022). https://doi.org/10.1109/ICIT56493.2022.9988933
Atig, M.F., Bouajjani, A., Burckhardt, S., Musuvathi, M.: What’s decidable about weak memory models? In: Seidl, H. (ed.) ESOP 2012. LNCS, vol. 7211, pp. 26–46. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28869-2_2
Chapter Google Scholar
Boehm, H.J., Adve, S.V.: Foundations of the c++ concurrency memory model. In: Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 68–78. PLDI 2008, Association for Computing Machinery, New York, NY, USA (2008). https://doi.org/10.1145/1375581.1375591
Dubois, M., Scheurich, C., Briggs, F.: Memory access buffering in multiprocessors. In: Proceedings of the 13th Annual International Symposium on Computer Architecture, pp. 434–442. ISCA 1986, IEEE Computer Society Press, Washington, DC, USA (1986)
Google Scholar
Flur, S., et al.: Mixed-size concurrency: arm, power, C/C++11, and sc. In: Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages, pp. 429–442. POPL 2017, Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3009837.3009839
Gharachorloo, K., Gupta, A., Hennessy, J.: Performance evaluation of memory consistency models for shared-memory multiprocessors. In: Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 245–257. ASPLOS IV, Association for Computing Machinery, New York, NY, USA (1991). https://doi.org/10.1145/106972.106997
Gharachorloo, K., Lenoski, D., Laudon, J., Gibbons, P., Gupta, A., Hennessy, J.: Memory consistency and event ordering in scalable shared-memory multiprocessors. SIGARCH Comput. Archit. News 18(2SI), 15–26 (1990). https://doi.org/10.1145/325096.325102
Goodman, J.R.: Cache consistency and sequential consistency (1991). http://digital.library.wisc.edu/1793/59442. Accessed 28 Mar 2023
Gupta, N., Ashiwal, R., Brank, B., Peddoju, S.K., Pleiter, D.: Performance evaluation of parallex execution model on ARM-based platforms. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), pp. 567–575 (2020). https://doi.org/10.1109/CLUSTER49012.2020.00080
Higham, L., Kawash, J., Verwaal, N.: Defining and comparing memory consistency models (1997)
Google Scholar
Johnson, D.: Apple M1 Microarchitecture Research (2023). https://dougallj.github.io/applecpu/firestorm.html. Accessed 23 Mar 2023
Kenyon, C., Capano, C.: Apple silicon performance in scientific computing. In: 2022 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–10 (2022). https://doi.org/10.1109/HPEC55821.2022.9926315
Kodama, Y., Kondo, M., Sato, M.: Evaluation of SPEC CPU and SPEC OMP on the A64FX. In: 2021 IEEE International Conference on Cluster Computing (CLUSTER), pp. 553–561 (2021). https://doi.org/10.1109/Cluster48925.2021.00088
Lamport: How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. Comput. C 28(9), 690–691 (1979). https://doi.org/10.1109/TC.1979.1675439
Mattioli, M.: Meet the fam1ly. IEEE Micro 42(3), 78–84 (2022). https://doi.org/10.1109/MM.2022.3169245
Article Google Scholar
Naeem, A., Chen, X., Lu, Z., Jantsch, A.: Realization and performance comparison of sequential and weak memory consistency models in network-on-chip based multi-core systems. In: 16th Asia and South Pacific Design Automation Conference (ASP-DAC 2011). pp. 154–159 (2011). https://doi.org/10.1109/ASPDAC.2011.5722176
Ouro, P., Lopez-Novoa, U., Guest, M.F.: On the performance of a highly-scalable computational fluid dynamics code on AMD, arm and intel processor-based HPC systems. Comput. Phys. Commun. 269, 108105 (2021). https://doi.org/10.1016/j.cpc.2021.108105. https://www.sciencedirect.com/science/article/pii/S0010465521002174
Pulte, C., Flur, S., Deacon, W., French, J., Sarkar, S., Sewell, P.: Simplifying ARM concurrency: multicopy-atomic axiomatic and operational models for ARMv8. Proc. ACM Program. Lang. 2(POPL), 1–29(2017). https://doi.org/10.1145/3158107
SPARC International Inc, C.: The SPARC Architecture Manual: Version 8. Prentice-Hall Inc, USA (1992)
Google Scholar
SPARC International Inc, C.: The SPARC Architecture Manual (Version 9). Prentice-Hall Inc, USA (1994)
Google Scholar
Xia, J., Cheng, C., Zhou, X., Hu, Y., Chun, P.: Kunpeng 920: the first 7-nm Chiplet-based 64-core ARM SOC for cloud services. IEEE Micro 41(5), 67–75 (2021). https://doi.org/10.1109/MM.2021.3085578
Article Google Scholar

Download references

Author information

Authors and Affiliations

Systems Research and Architecture Group, Leibniz Universität Hannover, Hannover, Germany
Lars Wrenger, Dominik Töllner & Daniel Lohmann

Authors

Lars Wrenger
View author publications
You can also search for this author in PubMed Google Scholar
Dominik Töllner
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Lohmann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lars Wrenger .

Editor information

Editors and Affiliations

National Technical University of Athens, Athens, Greece
Georgios Goumas
Kiel University, Kiel, Germany
Sven Tomforde
Gottfried Wilhelm Leibniz Universität Hannover, Hannover, Germany
Jürgen Brehm
Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
Stefan Wildermann
Otto-von-Guericke University Magdeburg, Magdeburg, Germany
Thilo Pionteck

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wrenger, L., Töllner, D., Lohmann, D. (2023). TOSTING: Investigating Total Store Ordering on ARM. In: Goumas, G., Tomforde, S., Brehm, J., Wildermann, S., Pionteck, T. (eds) Architecture of Computing Systems. ARCS 2023. Lecture Notes in Computer Science, vol 13949. Springer, Cham. https://doi.org/10.1007/978-3-031-42785-5_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-42785-5_10
Published: 26 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42784-8
Online ISBN: 978-3-031-42785-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

TOSTING: Investigating Total Store Ordering on ARM