The Journal of Supercomputing

, Volume 72, Issue 11, pp 4129–4159 | Cite as

Mitigation of NUMA and synchronization effects in high-speed network storage over raw Ethernet

  • Pilar González-FérezEmail author
  • Angelos Bilas


Current storage trends dictate placing fast storage devices in all servers and using them as a single distributed storage system. In this converged model where storage and compute resources co-exist in the same server, the role of the network is becoming more important: network overhead is becoming a main limitation to improving storage performance. At the same time, server consolidation dictates building servers that employ non-uniform memory architectures (NUMA) to scale memory performance and bundling multiple network links to increase network throughput. In this work, we use Tyche, an in-house protocol for network storage based on raw Ethernet, to examine and address (a) performance implications of NUMA servers on end-to-end path and (b) synchronization issues with multiple network interfaces (NICs) and multicore servers. We evaluate NUMA and synchronization issues on a real setup with multicore servers and six 10 GBits/s NICs on each server and we find that: (a) NUMA effects have significant negative impact and can reduce throughput by almost 2\(\times \) on our servers with as few as eight cores (16 hyper-threads). We design protocol extensions that almost entirely eliminate NUMA effects by encapsulating all protocol structures to a “channel” concept and then carefully mapping channels and their resources to NICs and NUMA nodes. (b) The traditional inline approach where each thread accesses the NIC to post-storage requests is preferable to using a queuing approach that trades locks for context switches, especially when the protocol is NUMA-aware. Overall, our results show that dealing with NUMA affinity and synchronization issues in network storage protocols allows network throughput between the target and initiator to scale by a factor of 2\(\times \) and beyond 60 GBits/s.


NUMA Memory affinity Synchronization Network storage Tyche I/O throughput 



We thankfully acknowledge the support of the European Commission under the 7th Framework Programs through the NanoStreams (FP7-ICT-610509) project, the HiPEAC3 (FP7-ICT-287759) Network of Excellence, and the COST programme Action IC1305, ‘Network for Sustainable Ultrascale Computing (NESUS)’.


  1. 1.
    González-Férez P, Bilas A (2014) Tyche: an efficient Ethernet-based protocol for converged networked storage. In: Proceedings of the IEEE 30th international conference on massive storage systems and technology (MSST)Google Scholar
  2. 2.
    González-Férez P, Bilas A (2015) Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet. In: Proceedings of the IEEE 31st international conference on massive storage systems and technology (MSST)Google Scholar
  3. 3.
    Intel (2009) An introduction to the Intel\(\textregistered \) QuickPath Interconnect. Accessed 29 Apr 2016
  4. 4.
    Dobson M, Gaughen P, Hohnbaum M, Focht E (2003) Linux support for NUMA hardware. In: Ottawa Linux symposiumGoogle Scholar
  5. 5.
    Lameter C (2006) Local and remote memory: memory in a Linux/NUMA system. In: Ottawa Linux symposiumGoogle Scholar
  6. 6.
    Mavridis S, Sfakianakis Y, Papagiannis A, Marazakis M, Bilas A (2014) Jericho: achieving scalability through optimal data placement on multicore systems. In: Proceedings of the IEEE 30th international conference on massive storage systems and technology (MSST)Google Scholar
  7. 7.
    Anderson TE (1990) The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans Parallel Distrib Syst 1(1):6–16CrossRefGoogle Scholar
  8. 8.
    Bjørling M, Axboe J, Nellans D, Bonnet P (2013) Linux block io: introducing multi-queue SSD access on multi-core systems. In: Proceedings of the 6th international systems and storage conferenceGoogle Scholar
  9. 9.
    Tallent NR, Mellor-Crummey JM, Porterfield A (2010) Analyzing lock contention in multithreaded applications. SIGPLAN Not 45(5):269–280CrossRefGoogle Scholar
  10. 10.
    Zheng D, Burns R, Szalay AS (2013) Toward millions of file system IOPS on low-cost, commodity hardware. In: Proceedings of the international conference on high performance computing, networking, storage and analysisGoogle Scholar
  11. 11.
    FIO benchmark. Accessed 29 Apr 2016
  12. 12.
    zmIO benchmark. Accessed 29 Apr 2016
  13. 13.
    (Intel) TW (2012) Intel\(\textregistered \) performance counter monitor—a better way to measure CPU utilization. Accessed 29 Apr 2016
  14. 14.
  15. 15.
    Dice D, Marathe VJ, Shavit N (2012) Lock cohorting: a general technique for designing numa locks. In: Proceedings of the 17th ACM SIGPLAN symposium on principles and practice of parallel programming. ACM, New York, pp 247–256Google Scholar
  16. 16.
    Dice D, Marathe VJ, Shavit N (2015) Lock cohorting: a general technique for designing NUMA locks. ACM Trans Parallel Comput 1(2):13:1–13:42CrossRefGoogle Scholar
  17. 17.
    Intel (2016) Intel SSD data center family. Accessed 29 Apr 2016
  18. 18.
    Samsung (2016) SSD 850 pro 2.5 sata iii 1 tb. Accessed 29 Apr 2016
  19. 19.
    Samsung (2016) SSD 950 pro nvme 512 gb. Accessed 29 Apr 2016
  20. 20.
    (2009) International technology roadmap for semiconductors: emerging research devicesGoogle Scholar
  21. 21.
    Breitwisch M (2008) Phase change memory. In: Proceedings of the interconnect technology conferenceGoogle Scholar
  22. 22.
    Coburn J, Bunker T, Schwarz M, Gupta R, Swanson S (2013) From aries to mars: transaction support for next-generation, solid-state drives. In: Proceedings of the twenty-fourth ACM symposium on operating systems principlesGoogle Scholar
  23. 23.
    Dieny B, Sousa R, Prenat G, Ebels U (2008) Spin-dependent phenomena and their implementation in spintronic devices. In: Proceedings of the VLSI technology, systems and applicationsGoogle Scholar
  24. 24.
    Meena JS, Sze SM, Chand U, Tseng TY (2014) Overview of emerging nonvolatile memory technologies. Nanoscale Res Lett 9(526):1–33Google Scholar
  25. 25.
    (2016) Persistent memory block driver. Accessed 29 Apr 2016
  26. 26.
    Chen F, Mesnier MP, Hahn S (2014) A protected block device for persistent memory. In: Proceedings of the IEEE 30th international conference on massive storage systems and technology (MSST)Google Scholar
  27. 27.
    Dashti M, Fedorova A, Funston J, Gaud F, Lachaize R, Lepers B, Quema V, Roth M (2013) Traffic management: a holistic approach to memory placement on NUMA systems. In: Proceedings of the eighteenth international conference on architectural support for programming languages and operating systemsGoogle Scholar
  28. 28.
    Lepers B, Quema V, Fedorova A (2015) Thread and memory placement on numa systems: asymmetry matters. In: 2015 USENIX annual technical conference (USENIX ATC’15). USENIX Association, Santa Clara, pp 277–289. Accessed 29 Apr 2016
  29. 29.
    Moreaud S, Goglin B (2007) Impact of NUMA effects on high-speed networking with multi-opteron machines. In: Proceedings of the international conference on parallel and distributed computing and systemsGoogle Scholar
  30. 30.
    Ren Y, Li T, Yu D, Jin S, Robertazzi T (2013) Design and performance evaluation of NUMA-aware RDMA-based end-to-end data transfer systems. In: Proceedings of international conference for high performance computing, networking, storage and analysisGoogle Scholar
  31. 31.
    Dumitru C, Koning R, Laat C (2011) 40 gigabit ethernet: prototyping transparent end-to-end connectivity. In: Proceedings of the Terena networking conferenceGoogle Scholar
  32. 32.
    Pesterev A, Strauss J, Zeldovich N, Morris RT (2012) Improving network connection locality on multicore systems. In: Proceedings of the 7th ACM European conference on computer systemsGoogle Scholar
  33. 33.
    Belay A, Prekas G, Klimovic A, Grossman S, Kozyrakis C, Bugnion E (2014) IX: a protected dataplane operating system for high throughput and low latency. In: Proceedings of the 11th USENIX symposium on operating systems design and implementation (OSDI’14)Google Scholar
  34. 34.
    Microsoft Coorporation (2014) Introduction to receive side scaling. Accessed 29 Apr 2016
  35. 35.
    Radovic Z, Hagersten E (2003) Hierarchical backoff locks for nonuniform communication architectures. In: Proceedings of the 9th international symposium on high-performance computer architecture. IEEE Computer Society, Washington, DC, pp 241–252Google Scholar
  36. 36.
    González-Férez P, Bilas A (2015) NUMA impact on network storage throughput over high-speed raw Ethernet. In: Proceedings of the international workshop of sustainable ultrascale networkGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.University of MurciaMurciaSpain
  2. 2.FORTH-ICSHeraklionGreece
  3. 3.University of CreteHeraklionGreece

Personalised recommendations