Skip to main content

TOSTING: Investigating Total Store Ordering on ARM

  • Conference paper
  • First Online:
Architecture of Computing Systems (ARCS 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13949))

Included in the following conference series:

  • 385 Accesses

Abstract

The Apple M1 ARM processors incorporate two memory consistency models: the conventional ARM weak memory ordering and the total store ordering (TSO) model from the x86 architecture employed by Apple’s x86 emulator, Rosetta 2. The presence of both memory ordering models on the same hardware enables us to thoroughly benchmark and compare their performance characteristics and worst-case workloads.

In this paper, we assess the performance implications of TSO on the Apple M1 processor architecture. Based on various workloads, our findings indicate that TSO is, on average, 8.94% slower than ARM’s weaker memory ordering. Through synthetic benchmarks, we further explore the workloads that experience the most significant performance degradation due to TSO.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. ARM Cortex-A Series - Programmer’s Guide for ARMv8-A. ARM Limited (2015)

    Google Scholar 

  2. Apple announces Mac transition to Apple silicon (2020). https://nr.apple.com/d2O2Y718J3. Accessed 22 Mar 2023

  3. Apple’s M1 Pro, M1 Max SoCs investigated: new performance and efficiency heights (2021). https://www.anandtech.com/show/17024/apple-m1-max-performance-review. Accessed 23 Mar 2023

  4. Apple M1 Ultra (2022). https://www.apple.com/newsroom/2022/03/apple-unveils-m1-ultra-the-worlds-most-powerful-chip-for-a-personal-computer/. Accessed 22 Mar 2023

  5. Intel 64 and IA-32 Architectures Software Developer’s Manual - Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4. Intel (2022). https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html. Accessed 30 May 2023

  6. Learn the architecture - Memory Systems, Ordering, and Barriers. ARM Limited (2022). https://developer.arm.com/documentation/102336/0100. Accessed 30 May 2023

  7. Asahi Linux docs wiki (2023). https://github.com/AsahiLinux/docs/wiki. Accessed 23 Mar 2023

  8. C++ atomic operations library (2023). https://en.cppreference.com/w/cpp/atomic. Accessed 26 Mar 2023

  9. Rosetta Translation Environment (2023). https://developer.apple.com/documentation/apple-silicon/about-the-rosetta-translation-environment. Accessed 22 Mar 2023

  10. Rust standard library - module std::sync::atomic (2023). https://doc.rust-lang.org/std/sync/atomic/index.html. Accessed 26 Mar 2023

  11. SPEC CPU benchmark package (2023). https://www.spec.org/cpu2017/. Accessed 27 Mar 2023

  12. The Standard Performance Evaluation Corporation (2023). https://www.spec.org/. Accessed 22 Mar 2023

  13. Tsoenabler for Linux (2023). https://github.com/cyyself/m1tso-linux. Accessed 26 Mar 2023

  14. Ali, Z., Tanveer, T., Aziz, S., Usman, M., Azam, A.: Reassessing the performance of arm vs x86 with recent technological shift of apple. In: 2022 International Conference on IT and Industrial Technologies (ICIT), pp. 01–06 (2022). https://doi.org/10.1109/ICIT56493.2022.9988933

  15. Atig, M.F., Bouajjani, A., Burckhardt, S., Musuvathi, M.: What’s decidable about weak memory models? In: Seidl, H. (ed.) ESOP 2012. LNCS, vol. 7211, pp. 26–46. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28869-2_2

    Chapter  Google Scholar 

  16. Boehm, H.J., Adve, S.V.: Foundations of the c++ concurrency memory model. In: Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 68–78. PLDI 2008, Association for Computing Machinery, New York, NY, USA (2008). https://doi.org/10.1145/1375581.1375591

  17. Dubois, M., Scheurich, C., Briggs, F.: Memory access buffering in multiprocessors. In: Proceedings of the 13th Annual International Symposium on Computer Architecture, pp. 434–442. ISCA 1986, IEEE Computer Society Press, Washington, DC, USA (1986)

    Google Scholar 

  18. Flur, S., et al.: Mixed-size concurrency: arm, power, C/C++11, and sc. In: Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages, pp. 429–442. POPL 2017, Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3009837.3009839

  19. Gharachorloo, K., Gupta, A., Hennessy, J.: Performance evaluation of memory consistency models for shared-memory multiprocessors. In: Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 245–257. ASPLOS IV, Association for Computing Machinery, New York, NY, USA (1991). https://doi.org/10.1145/106972.106997

  20. Gharachorloo, K., Lenoski, D., Laudon, J., Gibbons, P., Gupta, A., Hennessy, J.: Memory consistency and event ordering in scalable shared-memory multiprocessors. SIGARCH Comput. Archit. News 18(2SI), 15–26 (1990). https://doi.org/10.1145/325096.325102

  21. Goodman, J.R.: Cache consistency and sequential consistency (1991). http://digital.library.wisc.edu/1793/59442. Accessed 28 Mar 2023

  22. Gupta, N., Ashiwal, R., Brank, B., Peddoju, S.K., Pleiter, D.: Performance evaluation of parallex execution model on ARM-based platforms. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), pp. 567–575 (2020). https://doi.org/10.1109/CLUSTER49012.2020.00080

  23. Higham, L., Kawash, J., Verwaal, N.: Defining and comparing memory consistency models (1997)

    Google Scholar 

  24. Johnson, D.: Apple M1 Microarchitecture Research (2023). https://dougallj.github.io/applecpu/firestorm.html. Accessed 23 Mar 2023

  25. Kenyon, C., Capano, C.: Apple silicon performance in scientific computing. In: 2022 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–10 (2022). https://doi.org/10.1109/HPEC55821.2022.9926315

  26. Kodama, Y., Kondo, M., Sato, M.: Evaluation of SPEC CPU and SPEC OMP on the A64FX. In: 2021 IEEE International Conference on Cluster Computing (CLUSTER), pp. 553–561 (2021). https://doi.org/10.1109/Cluster48925.2021.00088

  27. Lamport: How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. Comput. C 28(9), 690–691 (1979). https://doi.org/10.1109/TC.1979.1675439

  28. Mattioli, M.: Meet the fam1ly. IEEE Micro 42(3), 78–84 (2022). https://doi.org/10.1109/MM.2022.3169245

    Article  Google Scholar 

  29. Naeem, A., Chen, X., Lu, Z., Jantsch, A.: Realization and performance comparison of sequential and weak memory consistency models in network-on-chip based multi-core systems. In: 16th Asia and South Pacific Design Automation Conference (ASP-DAC 2011). pp. 154–159 (2011). https://doi.org/10.1109/ASPDAC.2011.5722176

  30. Ouro, P., Lopez-Novoa, U., Guest, M.F.: On the performance of a highly-scalable computational fluid dynamics code on AMD, arm and intel processor-based HPC systems. Comput. Phys. Commun. 269, 108105 (2021). https://doi.org/10.1016/j.cpc.2021.108105. https://www.sciencedirect.com/science/article/pii/S0010465521002174

  31. Pulte, C., Flur, S., Deacon, W., French, J., Sarkar, S., Sewell, P.: Simplifying ARM concurrency: multicopy-atomic axiomatic and operational models for ARMv8. Proc. ACM Program. Lang. 2(POPL), 1–29(2017). https://doi.org/10.1145/3158107

  32. SPARC International Inc, C.: The SPARC Architecture Manual: Version 8. Prentice-Hall Inc, USA (1992)

    Google Scholar 

  33. SPARC International Inc, C.: The SPARC Architecture Manual (Version 9). Prentice-Hall Inc, USA (1994)

    Google Scholar 

  34. Xia, J., Cheng, C., Zhou, X., Hu, Y., Chun, P.: Kunpeng 920: the first 7-nm Chiplet-based 64-core ARM SOC for cloud services. IEEE Micro 41(5), 67–75 (2021). https://doi.org/10.1109/MM.2021.3085578

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lars Wrenger .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wrenger, L., Töllner, D., Lohmann, D. (2023). TOSTING: Investigating Total Store Ordering on ARM. In: Goumas, G., Tomforde, S., Brehm, J., Wildermann, S., Pionteck, T. (eds) Architecture of Computing Systems. ARCS 2023. Lecture Notes in Computer Science, vol 13949. Springer, Cham. https://doi.org/10.1007/978-3-031-42785-5_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-42785-5_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-42784-8

  • Online ISBN: 978-3-031-42785-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics