Skip to main content

Automatic Co-scheduling Based on Main Memory Bandwidth Usage

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10353))

Abstract

Most applications running on supercomputers achieve only a fraction of a system’s peak performance. It has been demonstrated that co-scheduling applications can improve overall system utilization. In this case, however, applications being co-scheduled need to fulfill certain criteria such that mutual slowdown is kept at a minimum. In this paper we present a set of libraries and a first HPC scheduler prototype that automatically detects an application’s main memory bandwidth utilization and prevents the co-scheduling of multiple main memory bandwidth limited applications. We demonstrate that our prototype achieves almost the same performance as we achieved with manually tuned co-schedules in previous work.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    A node is one endpoint in the network topology of an HPC system. It consists of general purpose processors with access to shared memory. Optionally, a node may be equipped with accelerators such as GPUs.

  2. 2.

    https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt

  3. 3.

    http://www.megware.com/

  4. 4.

    http://ark.intel.com/products/64595/Intel-Xeon-Processor-E5-2670-20M-Cache-2_60-GHz-8_00-GTs-Intel-QPI

  5. 5.

    http://mpiblast.org/

  6. 6.

    http://www.prace-ri.eu/

  7. 7.

    https://github.com/jbreitbart/mpifast

  8. 8.

    ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/drosoph.nt.gz

  9. 9.

    http://sourceforge.net/p/libama/git/ci/43a7ed

  10. 10.

    http://www.itp.uzh.ch/~teyssier/ramses/RAMSES.html

  11. 11.

    https://www-ssl.intel.com/content/www/us/en/communications/cache-monitoring-cache-allocation-technologies.html

  12. 12.

    The Stack Reuse Distance, introduced in [8], is the distance to the previous access to the same memory cell, measured in the number of distinct memory cells accessed in between. For the first access to an address, the distance is infinity.

  13. 13.

    https://github.com/lrr-tum/libdistgen

  14. 14.

    https://github.com/lrr-tum/ponci

  15. 15.

    https://www.docker.com/

  16. 16.

    The theoretical minimum of distgen is at about 33%, as distgen only reads from main memory and the other half can issue both reads and writes.

  17. 17.

    https://github.com/lrr-tum/poncos/tree/one-node-only

  18. 18.

    http://www.fast-project.de/

  19. 19.

    http://slurm.schedmd.com/

References

  1. Breitbart, J., Weidendorfer, J., Trinitis, C.: Case study on co-scheduling for HPC applications. In: 44th International Conference on Parallel Processing Workshops (ICPPW), pp. 277–285 (2015)

    Google Scholar 

  2. Kraus, J., Förster, M., Brandes, T., Soddemann, T.: Using lama for efficient amg on hybrid clusters. Comput. Sci. Res. Dev. 28(2–3), 211–220 (2013)

    Article  Google Scholar 

  3. Lin, H., Balaji, P., Poole, R., Sosa, C., Ma, X., Feng, W.-C.: Massively parallel genomic sequence search on the blue gene/p architecture. In: International Conference for High Performance Computing, Networking, Storage and Analysis. SC 2008, pp. 1–11. IEEE (2008)

    Google Scholar 

  4. Teyssier, R.: Cosmological hydrodynamics with adaptive mesh refinement-a new high resolution code called ramses. Astron. Astrophys. 385(1), 337–364 (2002)

    Article  Google Scholar 

  5. Lavallée, P.-F., de Verdière, G.C., Wautelet, P., Lecas, D., Dupays, J.-M.: Porting and optimizing HYDRO to new platforms and programming paradigms lessons learnt (2012). http://www.prace-project.eu/IMG/pdf/porting_and_optimizing_hydro_to_new_platforms.pdf

  6. Bertolacci, I.J., Olschanowsky, C., Harshbarger,B., Chamberlain, B.L., Wonnacott, D.G., Strout, M.M.: Parameterized diamond tiling for stencil computations with chapel parallel iterators. In: Proceedings of the 29th ACM on International Conference on Supercomputing, pp. 197–206. ACM (2015)

    Google Scholar 

  7. Weidendorfer, J., Breitbart, J.: Detailed characterization of HPC applications for co-scheduling. In: Proceedings of the 1st COSH Workshop on Co-Scheduling of HPC Applications, p. 19, January 2016

    Google Scholar 

  8. Bennett, B.T., Kruskal, V.J.: LRU stack processing. IBM J. Res. Dev. 19, 353–357 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  9. Klug, T., Ott, M., Weidendorfer, J., Trinitis, C.: Automated optimization of thread-to-core pinning on multicore systems. In: Stenström, P. (ed.) Transactions on High-Performance Embedded Architectures and Compilers III. LNCS, vol. 6590, pp. 219–235. Springer, Heidelberg (2011). doi:10.1007/978-3-642-19448-1_12

    Chapter  Google Scholar 

  10. Tsafack Chetsa, G.L., Lefèvre, L., Pierson, J.-M., Stolf, P., Da Costa, G.: Exploiting performance counters to predict and improve energy performance of HPC systems. Future Gener. Comput. Syst. 36, 287–298 (2014). https://hal.archives-ouvertes.fr/hal-01123831

    Article  Google Scholar 

  11. Wang, L., Von Laszewski, G., Dayal, J., Wang, F.: Towards energy aware scheduling for precedence constrained parallel tasks in a cluster with DVFS. In: 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid), pp. 368–377. IEEE (2010)

    Google Scholar 

  12. Rountree, B., Lownenthal, D.K., de Supinski, B.R., Schulz, M., Freeh, V.W., Bletsch, T.: Adagio: making DVS practical for complex HPC applications. In: Proceedings of the 23rd International Conference on Supercomputing, ser. ICS 2009, pp. 460–469. ACM, New York (2009). http://doi.acm.org/10.1145/1542275.1542340

  13. Teich, J., Henkel, J., Herkersdorf, A., Schmitt-Landsiedel, D., Schröder-Preikschat, W., Snelting, G.: Invasive computing: an overview. In: Hübner, M., Becker, J. (eds.) Multiprocessor System-on-Chip, pp. 241–268. Springer, New York (2011)

    Chapter  Google Scholar 

  14. Schreiber, M., Riesinger, C., Neckel, T., Bungartz, H.-J., Breuer, A.: Invasive compute balancing for applications with shared and hybrid parallelization. Int. J. Parallel Program. 1–24 (2014)

    Google Scholar 

  15. Auweter, A., Bode, A., Brehm, M., Huber, H., Kranzlmüller, D.: Principles of energy efficiency in high performance computing. In: Kranzlmüller, D., Toja, A.M. (eds.) ICT-GLOW 2011. LNCS, vol. 6868, pp. 18–25. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23447-7_3

    Chapter  Google Scholar 

  16. de Blanche, A., Lundqvist, T.: EnglishAddressing characterization methods for memory contention aware co-scheduling. Engl. J. Supercomput. 71(4), 1451–1483 (2015)

    Article  Google Scholar 

  17. Eklov, D., Nikoleris, N., Black-Schaffer, D., Hagersten, E.: Bandwidth bandit: Quantitative characterization of memory contention. In: 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 1–10 (2013)

    Google Scholar 

  18. Mars, J., Vachharajani, N., Hundt, R., Soffa, M.L.: Contention aware execution: online contention detection and response. In: Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, ser. CGO 2010, pp. 257–265. ACM, New York (2010)

    Google Scholar 

Download references

Acknowledgments

We want to thank MEGWARE, who provided us with a Clustsafe to measure energy consumption. The work presented in this paper was funded by the German Ministry of Education and Science as part of the FAST project (funding code 01IH11007A).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jens Breitbart .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Breitbart, J., Weidendorfer, J., Trinitis, C. (2017). Automatic Co-scheduling Based on Main Memory Bandwidth Usage. In: Desai, N., Cirne, W. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP JSSPP 2015 2016. Lecture Notes in Computer Science(), vol 10353. Springer, Cham. https://doi.org/10.1007/978-3-319-61756-5_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-61756-5_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-61755-8

  • Online ISBN: 978-3-319-61756-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics