The Journal of Supercomputing

, Volume 63, Issue 3, pp 691–709 | Cite as

Designing energy efficient communication runtime systems: a view from PGAS models

  • Abhinav Vishnu
  • Shuaiwen Song
  • Andres Marquez
  • Kevin Barker
  • Darren Kerbyson
  • Kirk Cameron
  • Pavan Balaji
Article

Abstract

As the march to the exascale computing gains momentum, energy consumption of supercomputers has emerged to be the critical roadblock. While architectural innovations are imperative in achieving computing of this scale, it is largely dependent on the systems software to leverage the architectural innovations. Parallel applications in many computationally intensive domains have been designed to leverage these supercomputers, with legacy two-sided communication semantics using Message Passing Interface. At the same time, Partitioned Global Address Space Models are being designed which provide global address space abstractions and one-sided communication for exploiting data locality and communication optimizations. PGAS models rely on one-sided communication runtime systems for leveraging high-speed networks to achieve best possible performance.

In this paper, we present a design for Power Aware One-Sided Communication Llibrary – PASCoL. The proposed design detects communication slack, leverages Dynamic Voltage and Frequency Scaling (DVFS), and Interrupt driven execution to exploit the detected slack for energy efficiency. We implement our design and evaluate it using synthetic benchmarks for one-sided communication primitives, Put, Get, and Accumulate and uniformly noncontiguous data transfers. Our performance evaluation indicates that we can achieve significant reduction in energy consumption without performance loss on multiple one-sided communication primitives. The achieved results are close to the theoretical peak available with the experimental test bed.

Keywords

Communication runtime system DVFS Energy efficiency InfiniBand 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Crosscutting Technologies for Computing at the Exascale. http://extremecomputing.labworks.org (2010)
  2. 2.
    Gropp W, Lusk E, Doss N, Skjellum A (1996) A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput 22(6):789–828 MATHCrossRefGoogle Scholar
  3. 3.
    Geist A, Gropp W, Huss-Lederman S, Lumsdaine A, Lusk EL, Saphir W, Skjellum T, Snir M (1996) MPI-2: Extending the message-passing interface. In: Euro-Par, vol I, pp 128–135 Google Scholar
  4. 4.
    Husbands P, Iancu C, Yelick KA (2003) A performance analysis of the Berkeley UPC compiler. In: International conference on supercomputing, pp 63–73 Google Scholar
  5. 5.
    Nieplocha J, Harrison RJ, Littlefield RJ (1996) Global arrays: a nonuniform memory access programming model for high-performance computers. J Supercomput 10(2):169–189 CrossRefGoogle Scholar
  6. 6.
    Charles P, Grothoff C, Saraswat V, Donawa C, Kielstra A, Ebcioglu K, von Praun C, Sarkar V (2005) X10: an object-oriented approach to non-uniform cluster computing. In: OOPSLA ’05: Proceedings of the 20th annual ACM SIGPLAN conference on object-oriented programming, systems, languages, and applications. ACM, New York, pp 519–538 CrossRefGoogle Scholar
  7. 7.
    Chamberlain BL, Callahan D, Zima HP (2007) Parallel programmability and the Chapel language. Int J High Perform Comput Appl 21(3):291–312 CrossRefGoogle Scholar
  8. 8.
    InfiniBand Trade Association (2004) InfiniBand Architecture Specification, Release 1.2, October 2004 Google Scholar
  9. 9.
    Yu H, Chung I-H, Moreira J (2006) Blue gene system software—topology mapping for blue gene/l supercomputer. In: SC ’06: Proceedings of the 2006 ACM/IEEE conference on supercomputing. ACM, New York, p 116 CrossRefGoogle Scholar
  10. 10.
    Petrini F, Feng W, Hoisie A, Coll S, Frachtenberg E (2002) The quadrics network: high-performance clustering technology. IEEE MICRO 22(1):46–57 CrossRefGoogle Scholar
  11. 11.
    Krishnan M, Vishnu A, Palmer B (2010) Aggregate remote memory copy interface Google Scholar
  12. 12.
    Shah G, Nieplocha J, Mirza JH, Kim C, Harrison RJ, Govindaraju R, Gildea KJ, DiNicola P, Bender CA (1998) Performance and experience with LAPI—a new high-performance communication library for the IBM RS/6000 SP. In: IPPS/SPDP, pp 260–266 Google Scholar
  13. 13.
    Kumar S, Dozsa G, Almasi G, Heidelberger P, Chen D, Giampapa ME, Blocksome M, Faraj A, Parker J, Ratterman J, Smith B, Archer CJ (2008) The deep computing messaging framework: generalized scalable message passing on the Blue Gene/P supercomputer. In: ICS ’08: Proceedings of the 22nd annual international conference on supercomputing, pp 94–103 CrossRefGoogle Scholar
  14. 14.
    Boden NJ, Cohen D, Felderman RE, Kulawik AE, Seitz CL, Seizovic JN, Su W (1995) Myrinet: a gigabit-per-second local area network. IEEE MICRO 15(1):29–36 CrossRefGoogle Scholar
  15. 15.
    Vishnu A, Mamidala A, Narravula S, Panda DK (2007) Automatic path migration over InfiniBand: early experiences. In: Proceedings of third international workshop on system management techniques, processes, and services, held in conjunction with IPDPS’07, March 2007 Google Scholar
  16. 16.
    Vishnu A, Mamidala AR, Jin H-W, Panda DK (2005) Performance modeling of subnet management on fat tree InfiniBand networks using OpenSM. In: Proceedings of first international workshop on system management techniques, processes, and services, held in conjunction with IPDPS’07 Google Scholar
  17. 17.
    Narravula S, Marnidala A, Vishnu A, Vaidyanathan K, Panda DK (2007) High performance distributed lock management services using network-based remote atomic operations. In: CCGRID, pp 583–590 Google Scholar
  18. 18.
    Narravula S, Mamidala A, Vishnu A, Santhanaraman G, Panda DK (2007) High performance MPI over iWARP: early experiences. In: International conference on parallel processing Google Scholar
  19. 19.
    IBM BlueGene Team (2008) Overview of the IBM Blue Gene/P project. IBM J Res Dev 52(1/2):199–220 Google Scholar
  20. 20.
    Feng W, Warren M, Weigle E (2002) The bladed beowulf: A cost-effective alternative to traditional beowulfs. In: IEEE international conference on cluster computing, p 245 CrossRefGoogle Scholar
  21. 21.
    Cameron KW, Ge R, Feng X (2005) High-performance, power-aware distributed computing for scientific applications. Computer 38(11):40–47 CrossRefGoogle Scholar
  22. 22.
    Rountree B, Lowenthal DK, Funk S, Freeh VW, de Supinski BR, Schulz M (2007) Bounding energy consumption in large-scale mpi programs. In: SC ’07: Proceedings of the ACM/IEEE conference on supercomputing. ACM, New York, pp 1–9 Google Scholar
  23. 23.
    Feng X, Ge R, Cameron KW (2005) Power and energy profiling of scientific applications on distributed systems. In: IPDPS ’05: Proceedings of the 19th IEEE international parallel and distributed processing symposium (IPDPS’05)—papers. IEEE Computer Society, Washington, p 34 CrossRefGoogle Scholar
  24. 24.
    Hsu C-H, Kremer U, Hsiao M (2001) Compiler-directed dynamic voltage/frequency scheduling for energy reduction in microprocessors. In: ISLPED ’01: Proceedings of the international symposium on low power electronics and design. ACM, New York, pp 275–278 Google Scholar
  25. 25.
    Burd TD, Brodersen RW (2000) Design issues for dynamic voltage scaling. In: ISLPED ’00: Proceedings of the 2000 international symposium on low power electronics and design. ACM, New York, pp 9–14 CrossRefGoogle Scholar
  26. 26.
    Benini L, de Micheli G (2000) System-level power optimization: techniques and tools. ACM Trans Des Autom Electron Syst 5(2):115–192 CrossRefGoogle Scholar
  27. 27.
    Vishnu A, Koop MJ, Moody A, Mamidala AR, Narravula S, Panda DK (2007) Hot-spot avoidance with multi-pathing over InfiniBand: an MPI perspective. In: Cluster computing and grid, pp 479–486 Google Scholar
  28. 28.
    LBNL (2003) Data Center Energy Benchmarking Case Study: Data Center Facility 5 Google Scholar
  29. 29.
    IBM (2007) PowerExecutive Google Scholar
  30. 30.
    Bailey AM (2002) Accelerated strategic computing initiative (asci): Driving the need for the terascale simulation facility (tsf). In: Energy 2002 workshop and exposition. IEEE Computer Society, Los Alamitos Google Scholar
  31. 31.
    Ye W, Vijaykrishnan N, Kandemir M, Irwin MJ (2000) The design and use of simple power: A cycle-accurate energy estimation tool, pp 340–345 Google Scholar
  32. 32.
    Brooks D, Tiwari V, Martonosi M (2000) Wattch: a framework for architectural-level power analysis and optimizations. In: ISCA ’00: proceedings of the 27th annual international symposium on computer architecture. ACM, New York, pp 83–94 CrossRefGoogle Scholar
  33. 33.
    Zedlewski J, Sobti S, Garg N, Zheng F, Krishnamurthy A, Wang R (2003) Modeling hard-disk power consumption. In: FAST ’03: proceedings of the 2nd USENIX conference on file and storage technologies. USENIX Association, Berkeley, pp 217–230 Google Scholar
  34. 34.
    Helmbold DP, Long DDE, Sherrod B (1996) A dynamic disk spin-down technique for mobile computing. In: MobiCom ’96: Proceedings of the 2nd annual international conference on mobile computing and networking. ACM, New York, pp 130–142 CrossRefGoogle Scholar
  35. 35.
    Douglis F, Krishnan P, Bershad BN (1995) Adaptive disk spin-down policies for mobile computers. In: MLICS ’95: Proceedings of the 2nd symposium on mobile and location-independent computing. USENIX Association, Berkeley, pp 121–137 Google Scholar
  36. 36.
    Ge R, Feng X, Song S, Chang H-C, Li D, Cameron KW (2009) Powerpack: Energy profiling and analysis of high-performance systems and applications. IEEE Trans Parallel Distrib Syst 99:658–671 Google Scholar
  37. 37.
    Moore J, Chase J, Ranganathan P, Sharma R (2005) Making scheduling “cool”: temperature-aware workload placement in data centers. In: ATEC ’05: Proceedings of the annual conference on USENIX annual technical conference, USENIX Association, Berkeley, p 5 Google Scholar
  38. 38.
    Xinping H-SW, Wang HS, Zhu X, Peh LS, Malik S (2002) Orion: A power-performance simulator for interconnection networks, pp 294–305 Google Scholar
  39. 39.
    Luszczek PR, Bailey DH, Dongarra JJ, Kepner J, Lucas RF, Rabenseifner R, Takahashi D (2006) The hpc challenge (hpcc) benchmark suite. In: SC ’06: Proceedings of the 2006 ACM/IEEE conference on supercomputing. ACM, New York, p 213 CrossRefGoogle Scholar
  40. 40.
    Song S, Ge R, Feng X, Cameron KW (2009) Energy profiling and analysis of the hpc challenge benchmarks. Int J High Perform Comput Appl 23(3):265–276 CrossRefGoogle Scholar
  41. 41.
    Kandalla SSK, Mancini EP, Panda DK (2010) Designing power-aware collective communication algorithms for InfiniBand clusters. Technical Report, June 2010 Google Scholar
  42. 42.
    Freeh VW, Lowenthal DK, Pan F, Kappiah N, Springer R, Rountree BL, Femal ME (2007) Analyzing the energy-time trade-off in high-performance computing applications. IEEE Trans Parallel Distrib Syst 18(6):835–848 CrossRefGoogle Scholar
  43. 43.
    Freeh VW, Pan F, Kappiah N, Lowenthal DK, Springer R (2005) Exploring the energy-time tradeoff in mpi programs on a power-scalable cluster. In: IPDPS ’05: Proceedings of the 19th IEEE international parallel and distributed processing symposium (IPDPS’05)—papers. IEEE Computer Society, Washington, DC, p 4.1 Google Scholar
  44. 44.
    Curtis-Maury M, Shah A, Blagojevic F, Nikolopoulos DS, de Supinski BR, Schulz M (2008) Prediction models for multi-dimensional power-performance optimization on many cores. In: PACT ’08: Proceedings of the 17th international conference on parallel architectures and compilation techniques. ACM, New York, pp 250–259 CrossRefGoogle Scholar
  45. 45.
    NAS (2010) NAS Parallel Benchmark Google Scholar
  46. 46.
    Ge R, Feng X, Feng W-C, Cameron KW (2007) Cpu miser: A performance-directed, run-time system for power-aware clusters. In: ICPP ’07: Proceedings of the 2007 international conference on parallel processing. IEEE Computer Society, Washington, DC, p 18 CrossRefGoogle Scholar
  47. 47.
    Zamani R, Afsahi A, Qian Y, Hamacher C (2007) A feasibility analysis of power-awareness and energy minimization in modern interconnects for high-performance computing. In: CLUSTER ’07: Proceedings of the 2007 IEEE international conference on cluster computing. IEEE Computer Society, Washington, DC, pp 118–128 CrossRefGoogle Scholar
  48. 48.
    Liu J, Poff D, Abali B (2009) Evaluating high performance communication: a power perspective. In: ICS ’09: Proceedings of the 23rd international conference on supercomputing. ACM, New York, pp 326–337 CrossRefGoogle Scholar
  49. 49.
    Kendall RA, Aprà E, Bernholdt DE, Bylaska EJ, Dupuis M, Fann GI, Harrison RJ, Ju J, Nichols JA, Nieplocha J, Straatsma TP, Windus TL, Wong AT (2000) High performance computational chemistry: an overview of NWChem, a distributed parallel application. Comput Phys Commun 128(1–2):260–283 MATHCrossRefGoogle Scholar
  50. 50.
    Subsurface Transport over Multiple Phases. STOMP. http://stomp.pnl.gov/

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Abhinav Vishnu
    • 1
  • Shuaiwen Song
    • 2
  • Andres Marquez
    • 1
  • Kevin Barker
    • 1
  • Darren Kerbyson
    • 1
  • Kirk Cameron
    • 2
  • Pavan Balaji
    • 3
  1. 1.High Performance Computing GroupPacific Northwest National LabRichlandUSA
  2. 2.Scalable Computing LabVirginia Polytechnic InstituteBlackburgUSA
  3. 3.Mathematics and Computer Science DivisionArgonne National LaboratoryArgonneUSA

Personalised recommendations