Topology-Aware Parallelism for NUMA Copying Collectors

  • Khaled AlnowaiserEmail author
  • Jeremy Singer
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9519)


NUMA-aware parallel algorithms in runtime systems attempt to improve locality by allocating memory from local NUMA nodes. Researchers have suggested that the garbage collector should profile memory access patterns or use object locality heuristics to determine the target NUMA node before moving an object. However, these solutions are costly when applied to every live object in the reference graph. Our earlier research suggests that connected objects represented by the rooted sub-graphs provide abundant locality and they are appropriate for NUMA architecture.

In this paper, we utilize the intrinsic locality of rooted sub-graphs to improve parallel copying collector performance. Our new topology-aware parallel copying collector preserves rooted sub-graph integrity by moving the connected objects as a unit to the target NUMA node. In addition, it distributes and assigns the copying tasks to appropriate (i.e. NUMA node local) GC threads. For load balancing, our solution enforces locality on the work-stealing mechanism by stealing from local NUMA nodes only. We evaluated our approach on SPECjbb2013, DaCapo 9.12 and Neo4j. Results show an improvement in GC performance by up to 2.5x speedup and 37 % better application performance.


NUMA Multi-core Work-stealing Runtime support Garbage collection 



We would like to thank the University of Prince Sattam bin Abdulaziz for funding this research. We also thank the UK EPSRC (under grant EP/L000725/1) for its partial support.


  1. 1.
    Alnowaiser, K.: A study of connected object locality in NUMA heaps. In: Proceedings of MSPC, pp. 1:1–1:9 (2014)Google Scholar
  2. 2.
    Anderson, T.A.: Optimizations in a private nursery-based garbage collector. In: Proceedings of ISMM, pp. 21–30 (2010)Google Scholar
  3. 3.
    Blackburn, S.M., et al.: The dacapo benchmarks: Java benchmarking development and analysis. In: Proceedings of OOPSLA, pp. 169–190 (2006)Google Scholar
  4. 4.
    Chicha, Y., Watt, S.M.: A localized tracing scheme applied to garbage collection. In: Kobayashi, N. (ed.) APLAS 2006. LNCS, vol. 4279, pp. 323–339. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  5. 5.
    Chilimbi, T., Larus, J.: Using generational garbage collection to implement cache-conscious data placement. In: Proceedings of ISMM, pp. 37–48 (1998)Google Scholar
  6. 6.
    Dashti, M., Fedorova, A., Funston, J.: Traffic management: a holistic approach to memory placement on NUMA systems. In: Proceedings of ASPLOS, pp. 381–393 (2013)Google Scholar
  7. 7.
    Domani, T., Goldshtein, G., Kolodner, E.K., Lewis, E., Petrank, E., Sheinwald, D.: Thread-local heaps for Java. In: Proceedings of ISMM, pp. 76–87 (2002)Google Scholar
  8. 8.
    Endo, T., Taura, K., Yonezawa, A.: A scalable mark-sweep garbage collector on large-scale shared-memory machines. In: Proceedings of SC, pp. 1–14 (1997)Google Scholar
  9. 9.
    Flood, C., Detlefs, D., Shavit, N., Zhang, X.: Parallel garbage collection for shared memory multiprocessors. In: Proceedings of JVM (2001)Google Scholar
  10. 10.
    Gidra, L., Thomas, G., Sopena, J., Shapiro, M.: Assessing the scalability of garbage collectors on many cores. In: Proceedings of PLOS, pp. 1–7 (2011)Google Scholar
  11. 11.
    Gidra, L., Thomas, G., Sopena, J., Shapiro, M., Nguyen, N.: NumaGiC: a garbage collector for big data on big NUMA machines. In: Proceedings of ASPLOS, pp. 661–673 (2015)Google Scholar
  12. 12.
    Hirzel, M., Henkel, J., Diwan, A., Hind, M.: Understanding the connectivity of heap objects. In: Proceedings of ISMM, pp. 36–49 (2002)Google Scholar
  13. 13.
    Huang, X., Blackburn, S.M., McKinley, K.S., Moss, J.E.B., Wang, Z., Cheng, P.: The garbage collection advantage: improving program locality. In: Proceedings of OOPSLA, pp. 69–80 (2004)Google Scholar
  14. 14.
    Jones, R., King, A.: A fast analysis for thread-local garbage collection with dynamic class loading. In: Proceedings of SCAM, pp. 129–138 (2005)Google Scholar
  15. 15.
    Kalibera, T., Mole, M., Jones, R., Vitek, J.: A black-box approach to understanding concurrency in DaCapo. In: Proceedings of OOPSLA, pp. 335–354 (2012)Google Scholar
  16. 16.
    Leskovec, J., Krevl, A.: SNAP datasets: stanford large network dataset collection, June 2014.
  17. 17.
    Majo, Z., Gross, T.R.: Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead. In: Proceedings of ISMM, pp. 11–20 (2011)Google Scholar
  18. 18.
    Marlow, S., Peyton Jones, S.: Multicore garbage collection with local heaps. In: Proceedings of ISMM, pp. 21–32 (2011)Google Scholar
  19. 19.
    Muddukrishna, A., Jonsson, P.A., Vlassov, V., Brorsson, M.: Locality-aware task scheduling and data distribution on NUMA systems. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 156–170. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  20. 20.
    Neo4J. (2015)
  21. 21.
    Oancea, C.E., Mycroft, A., Watt, S.M.: A new approach to parallelising tracing algorithms. In: Proceedings of ISMM, pp. 10–19 (2009)Google Scholar
  22. 22.
    Ogasawara, T.: NUMA-aware memory manager with dominant-thread-based copying GC. In: Proceedings of OOPSLA, pp. 377–390 (2009)Google Scholar
  23. 23.
    Olivier, S.L., Porterfield, A.K., Wheeler, K.B., Prins, J.F.: Scheduling task parallelism on multi-socket multicore systems. In: Proceedings of ROSS, pp. 49–56 (2011)Google Scholar
  24. 24.
    Sartor, J.B., Eeckhout, L.: Exploring multi-threaded Java application performance on multicore hardware. In: Proceedings of OOPSLA, New York, USA, pp. 281–296 (2012)Google Scholar
  25. 25.
    Shuf, Y., Gupta, M., Franke, H., Appel, A., Singh, J.P.: Creating and preserving locality of Java applications at allocation and garbage collection times. In: Proceedings of OOPSLA, pp. 13–25 (2002)Google Scholar
  26. 26.
    Siebert, F.: Limits of parallel marking garbage collection. In: Proceedings of ISMM, pp. 21–29 (2008)Google Scholar
  27. 27.
    SPECjbb2013: Standard Performance Evaluation Corporation Java Business Benchmark (2013).
  28. 28.
    Steensgaard, B.: Thread-specific heaps for multi-threaded programs. In: Proceedings of ISMM, pp. 18–24 (2000)Google Scholar
  29. 29.
    Tikir, M.M., Hollingsworth, J.K.: NUMA-aware Java heaps for server applications. In: Proceedings of IPDPS, pp. 108.b (2005)Google Scholar
  30. 30.
    Wilson, P.R., Lam, M.S., Moher, T.G.: Effective static-graph reorganization to improve locality in garbage-collected systems. In: Proceedings of PLDI, pp. 177–191 (1991)Google Scholar
  31. 31.
    Wu, M., Li, X.F.: Task-pushing: a scalable parallel GC marking algorithm without synchronization operations. In: Proceedings of IPDPS, pp. 1–10 (2007)Google Scholar
  32. 32.
    Zhou, J., Demsky, B.: Memory management for many-core processors with software configurable locality policies. In: Proceedings of ISMM, pp. 3–14 (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.University of GlasgowGlasgowUK

Personalised recommendations