Optimization of Uncore Data Flow on NUMA Platform

  • Qiuming Luo
  • Yuanyuan Zhou
  • Chang Kong
  • Mei Wang
  • Ye Cai
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8707)


Uncore part of the processor has a profound effect, especially in NUMA systems, since it is used to connect cores, last level caches (LLC), on-chip multiple memory controllers (MCs) and high-speed interconnections. In our previous study, we investigated several benchmarks’ data flow in Uncore of Intel Westmere microarchitecture and found that the data flow of Global Queue (GQ) and QuickPath Home Logical (QHL) has serious imbalance and congestion problem. This paper, we aims at the problem of entries’ low efficiency in GQ and QHL we set up an M/M/3 Queue Model for GQ and QHL’s three trackers’ data flow, and then design a Dynamic Entries Management (DEM) mechanism which could improve entries’ efficiency dramatically. The model is implemented in Matlab to simulate two different data flow pattern. Experiment results shows that DEM mechanism reduces stall cycles of trackers significantly: DEM reduces almost 60% stall cycles under smooth request sequences; DEM mechanism reduces almost 20~30% stall cycles under burst request sequences.


NUMA Uncore Data flow DEM 


  1. 1.
    Luo, Q., Kong, C., Zhou, Y., et al.: Understanding the Data Traffic of Uncore in Westmere NUMA Architecture. In: 22th Euromicro International Conference on Parallel, Distributed and Network-Based Processing. IEEE, Turin (2014)Google Scholar
  2. 2.
    Advanced Micro Devices. AMD HyperTransport Technology-based system architecture [EB/OL]. AMD, Sunnyval (May 2002),
  3. 3.
    Maddox, R.A., Singh, G., Safranek, R.J.: A first look at the Intel QuickPath Interconnect [EB/OL]. Intel Corporation, Hillsboto (April 28, 2009),
  4. 4.
    Li, H., Tandri, S., Stumm, M., Sevcik, K.C.: Locality and loop scheduling on NUMA multiprocessors. In: International Conference on Parallel Processing (ICPP). IEEE, New York (1993)Google Scholar
  5. 5.
    Marathe, J., Mueller, F.: Hardware profile-guided automatic page placement for ccNUMA systems. In: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (PPoPP). ACM, New York (2006)Google Scholar
  6. 6.
    McCurdy, C., Vetter, J.C.: Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In: International Symposium on Performance Analysis of Systems & Software (ISPASS). IEEE, New York (2010)Google Scholar
  7. 7.
    Ogasawara, T.: NUMA-aware memory manager with dominant-thread-based copying GC. In: Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA). ACM, New York (2009)Google Scholar
  8. 8.
    Tikir, M.M., Hollingsworth, J.K.: NUMA-aware Java heaps for server applications. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, Colorado (2005)Google Scholar
  9. 9.
    Tikir, M.M., Hollingsworth, J.K.: Hardware monitors for dynamic page migration. Journal of Parallel and Distributed Computing 68(9), 1186–1200 (2008)CrossRefGoogle Scholar
  10. 10.
    Verghese, B., Devine, S., Gupta, A., et al.: Operating system support for improving data locality on CC-NUMA computer servers. In: Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM Press, New York (1996)Google Scholar
  11. 11.
    Wilson, K.M., Aglietti, B.B.: Dynamic page placement to improve locality in CC-NUMA multiprocessors for TPC-C. In: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing (SC). ACM/IEEE, New York (2001)Google Scholar
  12. 12.
    Awasthi, M., Nellans, D.W., Sudan, K., et al.: Handling the problems and opportunities posed by multiple on-chip memory controllers. In: 19th International Conference on Parallel Architecture and Compilation Techniques(PACT). ACM, Vienna (2010)Google Scholar
  13. 13.
    Majo, Z., Gross, T.R.: Memory System Performance in a NUMA Multicore Multiprocessor. In: Proceedings of the 4th Annual International Conference on Systems and Storage (SYSTOR). ACM, New York (2011)Google Scholar
  14. 14.
    Luo, Q., Zhou, Y., Kong, C., Liu, G., Cai, Y., Lin, X.-H.: Analyzing the Characteristics of Memory Subsystem on Two different 8-way NUMA Architectures. In: Hsu, C.-H., Li, X., Shi, X., Zheng, R. (eds.) NPC 2013. LNCS, vol. 8147, pp. 155–166. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  15. 15.
    Dashti, M., Fedorova, A., Funston, J., Gaud, F., Lachaize, R., Lepers, B., et al.: Traffic management: A holistic approach to memory placement on NUMA systems. In: The 18th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, Houston (2013)Google Scholar
  16. 16.
    Intel Corporation. Intel 64 and IA-32 Architectures Optimization Reference Manual [EB/OL]. Intel Corporation (April 2010),
  17. 17.
    Yang, R., Antony, J., Rendell, A., Robson, D., Strazdins, P.: Profiling directed NUMA optimization on Linux systems: A case study of the Gaussian computational chemistry code. In: The 25th IEEE International Parallel and Distributed Processing Symposium. IEEE, Anchorage (2011)Google Scholar
  18. 18.
    Treibig, J., Meier, M., Hager, G., Wellein, G.: Poster - LIKWID:Lightweight performance tools. In: The2011 High Performance Computing Networking, Storage and Analysis. ACM, Seattle (2011)Google Scholar
  19. 19.
  20. 20.
    STREAM Benchmark [CP],

Copyright information

© IFIP International Federation for Information Processing 2014

Authors and Affiliations

  • Qiuming Luo
    • 1
    • 2
  • Yuanyuan Zhou
    • 1
  • Chang Kong
    • 1
  • Mei Wang
    • 3
  • Ye Cai
    • 1
    • 2
  1. 1.Guangdong Province Key Laboratory of Popular High Performance ComptersSZUChina
  2. 2.College of Computer Science and Software EngineeringSZUChina
  3. 3.School of Computer EngineeringShenzhen PolytechnicChina

Personalised recommendations