Dealing with Layers of Obfuscation in Pseudo-Uniform Memory Architectures

  • Randolf RottaEmail author
  • Robert Kuban
  • Mark Simon Schöps
  • Jörg NolteEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10104)


Pseudo-Uniform Memory Architectures hide the memory’s throughput bottlenecks and the network’s latency differences in order to provide near-peak average throughput for computations on large datasets. This obviates the need for application-level partitioning and load balancing between NUMA domains but the performance of cross-core communication still depends on the actual placement of the involved variables and cores, which can result in significant variation within applications and between application runs.

This paper analyses the pseudo-uniform memory latency on the Intel Xeon Phi Knights Corner processor, derives strategies for the optimised placement of important variables, and discusses the role of localised coordination in pUMA systems. For example, a basic cache line ping-pong benchmark showed a 3x speedup between adjacent cores. Therefore, pUMA systems combined with support for controlled placement of small datasets are an interesting option when processor-wide load balancing is difficult while localised coordination is feasible.


Cache Line Memory Bank Memory Architecture Cache Coherence Physical Address 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work was financed by the German Federal Ministry of Education and Research (BMBF) in the MyThOS project, grant no. 01IH13003C.


  1. 1.
    Dashti, M., Fedorova, A., Funston, J., Gaud, F., Lachaize, R., Lepers, B., Quema, V., Roth, M.: Traffic management: a holistic approach to memory placement on NUMA systems. In: Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2013, pp. 381–394. ACM, New York (2013)Google Scholar
  2. 2.
    Agarwal, A., Simoni, R., Hennessy, J., Horowitz, M.: An evaluation of directory schemes for cache coherence. SIGARCH Comput. Archit. News 16(2), 280–298 (1988)CrossRefGoogle Scholar
  3. 3.
    Hackenberg, D., Molka, D., Nagel, W.E.: Comparing cache architectures and coherency protocols on x86–64 multicore SMP systems. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pp. 413–422. ACM, New York (2009)Google Scholar
  4. 4.
    Choi, I., Zhao, M., Yang, X., Yeung, D.: Experience with improving distributed shared cache performance on tilera’s tile processor. Comput. Archit. Lett. 10(2), 45–48 (2011)CrossRefGoogle Scholar
  5. 5.
    Gerofi, B., Takagi, M., Ishikawa, Y.: Exploiting hidden non-uniformity of uniform memory access on manycore CPUs. In: Lopes, L., et al. (eds.) Euro-Par 2014. LNCS, vol. 8806, pp. 242–253. Springer, Cham (2014). doi: 10.1007/978-3-319-14313-2_21 Google Scholar
  6. 6.
    Intel Corporation: Intel Xeon Phi Coprocessor System Software Developers Guide.
  7. 7.
    Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., et al.: Larrabee: a many-core x86 architecture for visual computing. In: ACM Transactions on Graphics (TOG), vol. 27, p. 18. ACM (2008)Google Scholar
  8. 8.
    Ramos, S., Hoefler, T.: Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi. In: Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2013, pp. 97–108. ACM, New York (2013)Google Scholar
  9. 9.
    Fang, J., Sips, H., Zhang, L., Xu, C., Che, Y., Varbanescu, A.L.: Test-driving Intel Xeon Phi. In: Proceedings of the 5th ACM/SPEC International Conference on Performance Engineering, ICPE 2014, pp. 137–148. ACM, New York (2014)Google Scholar
  10. 10.
    Fang, Z., Mehta, S., Yew, P.C., Zhai, A., Greensky, J., Beeraka, G., Zang, B.: Measuring microarchitectural details of multi- and many-core memory systems through microbenchmarking. ACM Trans. Archit. Code Optim. 11(4) 55:1–55:26 (2015)Google Scholar
  11. 11.
    Cascaval, C., Castanos, J.G., Ceze, L., Denneau, M., Gupta, M., Lieber, D., Moreira, J.E., Strauss, K., Warren, H.S.: Evaluation of a multithreaded architecture for cellular computing. In: 2002 Proceedings of Eighth International Symposium on High-Performance Computer Architecture, pp. 311–321, February 2002Google Scholar
  12. 12.
    Feehrer, J., Jairath, S., Loewenstein, P., Sivaramakrishnan, R., Smentek, D., Turullols, S., Vahidsafa, A.: The Oracle Sparc T5 16-core processor scales to eight sockets. IEEE Micro 33(2), 48–57 (2013)CrossRefGoogle Scholar
  13. 13.
    Intel Corporation: Intel Xeon Processor 7500 Series Datasheet, vol. 2, March 2010.
  14. 14.
    Lameter, C.: NUMA (non-uniform memory access): an overview. Queue, 11(7) 40:40–40:51 (2013)Google Scholar
  15. 15.
    Bianchini, R., Crovella, M.E., Kontothanassis, L., LeBlanc, T.J.: Software interleaving. In: 1994 Proceedings of Sixth IEEE Symposium on Parallel and Distributed Processing, pp. 56–65, October 1994Google Scholar
  16. 16.
    Tang, P., Yew, P.C.: Software combining algorithms for distributing hot-spot addressing. J. Parallel Distrib. Comput. 10(2), 130–139 (1990)CrossRefGoogle Scholar
  17. 17.
    Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9(1), 21–65 (1991)CrossRefGoogle Scholar
  18. 18.
    Magnusson, P., Landin, A., Hagersten, E.: Queue locks on cache coherent multiprocessors. In: 1994 Proceedings of Eighth International Parallel Processing Symposium, pp. 165–171, April 1994Google Scholar
  19. 19.
    Min, S.J., Iancu, C., Yelick, K.: Hierarchical work stealing on manycore clusters. In: 5th Conference on Partitioned Global Address Space Programming Models (2011)Google Scholar
  20. 20.
    Gamsa, B., Krieger, O., Appavoo, J., Stumm, M.: Tornado: maximizing locality and concurrency in a shared memory multiprocessor operating system. In: Proceedings of the Third Symposium on Operating Systems Design and Implementation, OSDI 1999, Berkeley, CA, USA, pp. 87–100. USENIX Association (1999)Google Scholar
  21. 21.
    Radovic, Z., Hagersten, E.: Hierarchical backoff locks for nonuniform communication architectures. In: Proceedings of the Ninth International Symposium on High-Performance Computer Architecture, HPCA-9 2003, pp. 241–252, February 2003Google Scholar
  22. 22.
    Yew, P.C., Tzeng, N.F., Lawrie, D.H.: Distributing hot-spot addressing in large-scale multiprocessors. IEEE Trans. Comput. C-36(4) 388–395 (1987)Google Scholar
  23. 23.
    Dice, D., Marathe, V.J., Shavit, N.: Lock cohorting: a general technique for designing NUMA locks. SIGPLAN Not. 47(8), 247–256 (2012)CrossRefGoogle Scholar
  24. 24.
    Fatourou, P., Kallimanis, N.D.: Revisiting the combining synchronization technique. SIGPLAN Not. 47(8), 257–266 (2012)CrossRefGoogle Scholar
  25. 25.
    David, T., Guerraoui, R., Trigonakis, V.: Everything you always wanted to know about synchronization but were afraid to ask. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles SOSP 2013, pp. 33–48. ACM, New York (2013)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Brandenburg University of Technology Cottbus-SenftenbergCottbusGermany

Personalised recommendations