Skip to main content

Path-Synchronous Performance Monitoring in HPC Interconnection Networks with Source-Code Attribution

  • Conference paper
  • First Online:
High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation (PMBS 2017)

Abstract

Performance anomalies involving interconnection networks have largely remained a “black box” for developers relying on traditional CPU profilers. Network-side profilers collect aggregate statistics and lack source-code attribution. We have incorporated an effective protocol extension in the Gen-Z communication protocol for tagging network packets in an interconnection network; additionally, we have backed the protocol extension with hardware and software enhancements that allow tracking the flow of a network transaction through every hop in the interconnection network and associate it back to the application source code. The result is a first-of-its-kind hardware-assisted telemetry of disparate, autonomous interconnection networking components with application source code association that offers better developer insights. Our scheme works on a sampling basis to ensure low runtime overhead and generates modest volumes of data. Simulation of our methods in the open-source Structural Simulation Toolkit (SST/Macro) shows its effectiveness—deep insights into the underlying network details to the developer at minimal overheads.

M. Chabbi—Work done while at Hewlett Packard Labs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Oliker, L., Canning, A., Carter, J., Shalf, J., Ethier, S.: Scientific application performance on leading scalar and vector supercomputing platforms. Int. J. High Perform. Comput. Appl. 22(1), 5–20 (2006)

    Article  Google Scholar 

  2. Dongarra, J., Heroux, M.A.: Toward a new metric for ranking high performance computing systems. Sandia report, SAND2013-4744 312, p. 150 (2013)

    Google Scholar 

  3. Egawa, R., Komatsu, K., Momose, S., Isobe, Y., Musa, A., Takizawa, H., Kobayashi, H.: Potential of a modern vector supercomputer for practical applications: performance evaluation of SX-ACE. J. Supercomput., March 2017

    Google Scholar 

  4. Intel Inc.: Intel VTune. https://software.intel.com/en-us/intel-vtune-amplifier-xe

  5. Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: HPCToolkit: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 22(6), 685–701 (2010)

    Google Scholar 

  6. Geimer, M., Wolf, F., Wylie, B.J.N., Ábrahám, E., Becker, D., Mohr, B.: The Scalasca performance toolset architecture. Concurr. Comput. Pract. Exp. 22(6), 702–719 (2010)

    Google Scholar 

  7. Shende, S.S., Malony, A.D.: The Tau parallel performance system. Int. J. High Perform. Comput. Appl. 20(2), 287–311 (2006)

    Article  Google Scholar 

  8. Oracle Inc.: Oracle Solaris Studio. http://www.oracle.com/technetwork/server-storage/solarisstudio/overview/index.html

  9. Intel Inc.: Intel Trace Analyzer and Collector, October 2017. https://software.intel.com/en-us/intel-trace-analyzer

  10. Allinea Inc.: Allinea MAP - C/C++ profiler and Fortran profiler for high performance Linux code, October 2017. https://www.allinea.com/products/map

  11. Liu, X., Mellor-Crummey, J.: A data-centric profiler for parallel programs. In: Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, vol. 28 (2013)

    Google Scholar 

  12. Rane, A., Browne, J.: Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics. In: Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, Minneapolis, MN, USA. IEEE Computer Society (2012)

    Google Scholar 

  13. Böhme, D., Geimer, M., Arnold, L., Voigtlaender, F., Wolf, F.: Identifying the root causes of wait states in large-scale parallel applications. ACM Trans. Parallel Comput. 3(2), 11:1–11:24 (2016)

    Article  Google Scholar 

  14. Isaacs, K.E., Gamblin, T., Bhatele, A., Schulz, M., Hamann, B., Bremer, P.T.: Ordering traces logically to identify lateness in message passing programs. IEEE Trans. Parallel Distrib. Syst. 27(3), 829–840 (2016)

    Article  Google Scholar 

  15. Weber, M., Brendel, R., Hilbrich, T., Mohror, K., Schulz, M., Brunst, H.: Structural clustering: a new approach to support performance analysis at scale. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 484–493, May 2016

    Google Scholar 

  16. Isaacs, K.E., Giménez, A., Jusufi, I., Gamblin, T., Bhatele, A., Schulz, M., Hamann, B., Bremer, P.T.: State of the art of performance visualization. In: Borgo, R., Maciejewski, R., Viola, I. (eds.) EuroVis - STARs. The Eurographics Association (2014)

    Google Scholar 

  17. Valiev, M., Bylaska, E., Govind, N., Kowalski, K., Straatsma, T., Dam, H.V., Wang, D., Nieplocha, J., Apra, E., Windus, T., de Jong, W.: NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations. Comput. Phys. Commun. 181(9), 1477–1489 (2010)

    Article  MATH  Google Scholar 

  18. Kim, J., Dally, W.J., Scott, S., Abts, D.: Technology-driven, highly-scalable dragonfly topology. In: Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA 2008, Washington, DC, USA, pp. 77–88. IEEE Computer Society (2008)

    Google Scholar 

  19. Alverson, B., Kaplan, L., Roweth, D.: Cray XC Series Network. http://www.cray.com/sites/default/files/resources/CrayXCNetwork.pdf

  20. National Energy Research Scientific Computing Center: Edison. http://www.nersc.gov/users/computational-systems/edison/

  21. Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9(1), 21–65 (1991)

    Article  Google Scholar 

  22. Gen-Z Consortium: Gen-Z: Draft Core Specification, July 2017. http://genzconsortium.org/specifications/draft-core-specification-july-2017/

  23. Linux wiki: Linux perf tool. https://perf.wiki.kernel.org/index.php/Main_Page

  24. Zaki, O., Lusk, E., Gropp, W., Swider, D.: Toward scalable performance visualization with jumpshot. High Perf. Comput. Appl. 13(2), 277–288 (1999)

    Article  Google Scholar 

  25. Karrels, E., Lusk, E.: Performance analysis of MPI programs. In: Dongarra, J., Tourancheau, B. (eds.) Proceedings of the Workshop on Environments and Tools For Parallel Scientific Computing, pp. 195–200. SIAM Publications (1994)

    Google Scholar 

  26. Knüpfer, A., Brunst, H., Doleschal, J., Jurenz, M., Lieber, M., Mickler, H., Müller, M.S., Nagel, W.E.: The vampir performance analysis tool-set. Tools High Perf. Comput. 139–155 (2008)

    Google Scholar 

  27. McDonald, N.: SuperSim: a flexible event-driven cycle-accurate network simulator. https://github.com/HewlettPackard/supersim

  28. Carothers, C.: ROSS: Rensselaer’s Optimistic Simulation System. https://github.com/carothersc/ROSS/wiki

  29. Carothers, C.D., Bauer, D., Pearce, S.: ROSS: a high-performance, low memory, modular time warp system. In: Proceedings of the Fourteenth Workshop on Parallel and Distributed Simulation, PADS 2000, Washington, DC, USA, pp. 53–60. IEEE Computer Society (2000)

    Google Scholar 

  30. Liu, N., Carothers, C., Cope, J., Carns, P., Ross, R.: Model and simulation of exascale communication networks. J. Simul. 6(4), 227–236 (2012)

    Article  Google Scholar 

  31. Jain, N., Bhatele, A., White, S., Gamblin, T., Kale, L.V.: Evaluating hpc networks via simulation of parallel workloads. In: SC16: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 154–165, November 2016

    Google Scholar 

  32. Rodrigues, A.F., Hemmert, K.S., Barrett, B.W., Kersey, C., Oldfield, R., Weston, M., Risen, R., Cook, J., Rosenfeld, P., CooperBalls, E., Jacob, B.: The structural simulation toolkit. SIGMETRICS Perform. Eval. Rev. 38(4), 37–42 (2011)

    Article  Google Scholar 

  33. So-In, C.: A survey of network traffic monitoring and analysis tools. https://www.cse.wustl.edu/~jain/cse567-06/ftp/net_traffic_monitors3.pdf

  34. Cisco Inc.: Cisco IOS NetFlow. https://www.cisco.com/c/en/us/products/ios-nx-os-software/ios-netflow/index.html

  35. sFlow organization: sFlow. http://www.sflow.org/

  36. Hewlett Packard Labs: Network Performance Monitoring (NWPM) Tool. https://github.com/HewlettPackard/genz_tools_network_monitoring

  37. Plotly Technologies Inc.: Collaborative data science (2015). https://plot.ly

Download references

Acknowledgments

This work was supported (in part) by the US Department of Energy (DOE) under Cooperative Agreement DE-SC0012199, the Blackcomb 2 Project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adarsh Yoga .

Editor information

Editors and Affiliations

Appendices

A NWChem Profiles from HPCToolkit

Fig. 2.
figure 2

CPU execution hotspot in NWChem running on NERSC Edison with 1024 MPI ranks captured via HPCToolkit [5] profiler. 25% of execution on all MPI processes waste time waiting to acquire remote locks embedded deep inside many layers of host code. The cause of the lock waiting despite good load balance is unknown since CPU profiles do not capture networking hardware component internals.

B Profiles of NCAST Program

Fig. 3.
figure 3

Figure shows the visualization graphs generated for the NCAST program running 4096 MPI ranks.

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yoga, A., Chabbi, M. (2018). Path-Synchronous Performance Monitoring in HPC Interconnection Networks with Source-Code Attribution. In: Jarvis, S., Wright, S., Hammond, S. (eds) High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. PMBS 2017. Lecture Notes in Computer Science(), vol 10724. Springer, Cham. https://doi.org/10.1007/978-3-319-72971-8_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-72971-8_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-72970-1

  • Online ISBN: 978-3-319-72971-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics