Abstract
Memory affinity has become a key element to achieve scalable performance on multi-core platforms. Mechanisms such as thread scheduling, page allocation and cache prefetching are commonly employed to enhance memory affinity which keeps data close to the cores that access it. In particular, software transactional memory (STM) applications exhibit irregular memory access behavior that makes harder to determine which and when data will be needed by each core. Additionally, existing STM runtime systems are decoupled from issues such as thread and memory management. In this paper, we thus propose a skeleton-driven mechanism to improve memory affinity on STM applications that fit the worklist pattern employing a two-level approach. First, it addresses memory affinity in the DRAM level by automatic selecting page allocation policies. Then it employs data prefetching helper threads to improve affinity in the cache level. It relies on a skeleton framework to exploit the application pattern in order to provide automatic memory page allocation and cache prefetching. Our experimental results on the STAMP benchmark suite show that our proposed mechanism can achieve performance improvements of up to 46 %, with an average of 11 %, over a baseline version on two NUMA multi-core machines.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Remote read latency divided by local read latency (obtained from BenchIT).
References
Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: A view of the parallel computing landscape. Commun. ACM 52(10), 56–67 (2009)
Awasthi, M., Nellans, D.W., Sudan, K., Balasubramonian, R., Davis, A.: Handling the problems and opportunities posed by multiple on-chip memory controllers. In: PACT, pp. 319–330. ACM (2010). doi:10.1145/1854273.1854314
Baek, W., Minh, C.C., Trautmann, M., Kozyrakis, C., Olukotun, K.: The openTM transactional application programming interface. In: PACT 2007, pp. 376–387. IEEE Computer Society (2007)
Broquedis, F., Aumage, O., Goglin, B., Thibault, S., Wacrenier, P.A., Namyst, R.: Structuring the execution of openMP applications for multicore architectures. In: IPDPS, pp. 1–10. IEEE Computer Society (2010)
Broquedis, F., Clet Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: hwloc: A generic framework for managing hardware affinities in HPC applications. In: PDP, pp. 180–186. IEEE Computer Society (2010)
Castro, M., Góes, L.F.W., Fernandes, L.G., Méhaut, J.F.: Dynamic thread mapping based on machine learning for transactional memory applications. In: Euro-Par, pp. 465–476 (2012)
Castro, M., Góes, L.F.W., Ribeiro, C.P., Cole, M., Cintra, M., Méhaut, J.F.: A machine learning-based approach for thread mapping on transactional memory applications. In: HiPC, pp. 1–10 (2011)
Cole, M.: Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press & Pitman, London (1989)
Collins, J.D., Wang, H., Tullsen, D.M., Hughes, C., Lee, Y.F., Lavery, D., Shen, J.P.: Speculative Precomputation: Long-Range Prefetching of Delinquent Loads. In: ISCA, pp. 14–25. ACM (2001)
Dalessandro, L., Dice, D., Scott, M., Shavit, N., Spear, M.: Transactional mutex locks. In: Euro-Par, pp. 2–13. Springer (2010)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI, pp. 137–150. USENIX Association (2004)
Diener, M., Madruga, F., Rodrigues, E., Alves, M., Schneider, J., Navaux, P., Heiss, H.U.: Evaluating thread placement based on memory access patterns for multi-core processors. In: HPCC, pp. 491–496. IEEE Computer Society (2010)
Felber, P., Fetzer, C., Riegel, T.: Dynamic Performance tuning of word-based software transactional memory. In: PPoPP, pp. 237–246. ACM (2008). doi:10.1145/1345206.1345241
Felber, P., Fetzer, C., Riegel, T., Sturzrehm, H.: Transactifying applications using an open compiler framework. In: TRANSACT. ACM (2007)
Garner, B.D., Browne, S., Dongarra, J., Garner, N., Ho, G., Mucci, P.: A portable programming interface for performance evaluation on modern processors. Int. J. High Perform. Comput. Appl. 14, 189–204 (2000)
Góes, L.F.W.: Automatic skeleton-driven performance optimizations for transactional memory. Ph.D. thesis, School of Informatics, University of Edinburgh, UK (2012)
Goes, L.F.W., Ioannou, N., Xekalakis, P., Cole, M., Cintra, M.: Autotuning skeleton-driven optimizations for transactional worklist applications. IEEE Trans. Parallel Distrib. Syst. 23(12), 2205–2218 (2012)
Hong, S., Narayanan, S.H.K., Kandemir, M., Özturk, O.: Process variation aware thread mapping for chip multiprocessors. In: DATE, pp. 821–826. European Design and Automation Association (2009)
Kleen, A.: A NUMA API for Linux. Tech. Rep. Novell-4621437 (2005)
Larus, J., Rajwar, R.: Transactional Memory. Morgan & Claypool Publishers (2006)
McCool, M.: Structured parallel programming with deterministic patterns. In: HotPar, pp. 25–30. USENIX Association (2010)
Minh, C.C., Chung, J., Kozyrakis, C., Olukotun, K.: STAMP: Stanford transactional applications for multi-processing. In: IISWC, pp. 35–46. IEEE Computer Society (2008)
Nikas, K., Anastopoulos, N., Goumas, G., Koziris, N.: Employing transactional memory and helper threads to speedup Dijkstra’s algorithm. In: ICPP, pp. 388–395. IEEE Computer Society (2009)
Pousa Ribeiro, C., Castro, M., Carissimi, A., Méhaut, J.F.: Improving memory affinity of geophysics applications on NUMA platforms using Minas. In: VECPAR. Springer (2010)
Song, Y., Kalogeropulos, S., Tirumalai, P.: Design and implementation of a compiler framework for helper threading on multicore processors. In: PACT, pp. 99–109. IEEE Computer Society (2005)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Góes, L.F.W., Ribeiro, C.P., Castro, M. et al. Automatic Skeleton-Driven Memory Affinity for Transactional Worklist Applications. Int J Parallel Prog 42, 365–382 (2014). https://doi.org/10.1007/s10766-013-0253-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-013-0253-x