TCU: A Multi-Objective Hardware Thread Mapping Unit for HPC Clusters

Pujari, Ravi Kumar; Wild, Thomas; Herkersdorf, Andreas

doi:10.1007/978-3-319-41321-1_3

Ravi Kumar Pujari¹⁶,
Thomas Wild¹⁶ &
Andreas Herkersdorf¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9697))

Included in the following conference series:

International Conference on High Performance Computing

2602 Accesses
2 Citations

Abstract

Meeting multiple, partially orthogonal optimization targets during thread scheduling on HPC and manycore platforms simultaneously, like maximizing CPU performance, meeting deadlines of time critical tasks, minimizing power and securing thermal resilience, is a major challenge because of associated scalability and thread management overhead. We tackle these challenges by introducing the Thread Control Unit (TCU), a configurable, low-latency, low-overhead hardware thread mapper in compute nodes of an HPC cluster. The TCU takes various sensor information into account and can map threads to 4–16 CPUs of a compute node within a small and bounded number of clock cycles in round-robin, single- or multi-objective manner. The TCU design can consider not just load balancing or performance criteria but also physical constraints like temperature limits, power budgets and reliability aspects. Evaluations of different mapping policies show that multi-objective thread mapping provides about 10 to 40 % less mapping latency for periodic workloads compared to single-objective or round-robin policies. For bursty workloads under high load conditions, a 20 % reduction is achieved.

The TCU macro has a mere 9 % hardware area overhead and achieves more than 150 k thread mappings per second on an FPGA prototype of a RISC quad-core compute node operating at moderate 50 MHz. A 45 nm technology ASIC realization of TCU can operate well above 1 GHz and support up to 3.15 million thread mappings per second.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Association, I.T.: InfiniBand Architecture Specification, Release 1.0 (2000). http://www.infinibandta.org/specs
Borkar, S.: Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25(6), 10–16 (2005). http://dx.doi.org/10.1109/MM.2005.110
Article Google Scholar
Colmenares, J., Eads, G., Hofmeyr, S., Bird, S., Moreto, M., Chou, D., Gluzman, B., Roman, E., Bartolini, D., Mor, N., Asanovic, K., Kubiatowicz, J.: Tessellation: refactoring the OS around explicit resource containers with continuous adaptation. In: 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–10, May 2013
Google Scholar
Coskun, A., Rosing, T., Whisnant, K., Gross, K.: Static and dynamic temperature-aware scheduling for multiprocessor SoCs. IEEE Trans. Very Large Scale Integr. VLSI Syst. 16(9), 1127–1140 (2008)
Article Google Scholar
Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998). http://dx.doi.org/10.1109/99.660313
Article Google Scholar
Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., Burger, D.: Dark silicon and the end of multicore scaling. In: Proceedings of the 38th Annual International Symposium on Computer Architecture, pp. 365–376, ISCA 2011. ACM, New York, NY, USA (2011). http://doi.acm.org/10.1145/2000064.2000108
Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the Cilk-5 multithreaded language. SIGPLAN Not. 33(5), 212–223 (1998). http://doi.acm.org/10.1145/277652.277725
Article Google Scholar
Henkel, J., Herkersdorf, A., Bauer, L., Wild, T., Hubner, M., Pujari, R., Grudnitsky, A., Heisswolf, J., Zaib, A., Vogel, B., Lari, V., Kobbe, S.: Invasive manycore architectures. In: 2012 17th Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 193–200, January 2012
Google Scholar
Howard, J., Dighe, S., Hoskote, Y., Vangal, S., Finan, D., Ruhl, G., Jenkins, D., Wilson, H., Borkar, N., Schrom, G., Pailet, F., Jain, S., Jacob, T., Yada, S., Marella, S., Salihundam, P., Erraguntla, V., Konow, M., Riepen, M., Droege, G., Lindemann, J., Gries, M., Apel, T., Henriss, K., Lund-Larsen, T., Steibl, S., Borkar, S., De, V., Van Der Wijngaart, R., Mattson, T.: A 48-core IA-32 message-passing processor with DVFS in 45 nm CMOS. In: 2010 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pp. 108–109, February 2010
Google Scholar
Kumar, S., Hughes, C.J., Nguyen, A.: Carbon: architectural support for fine-grained parallelism on chip multiprocessors. SIGARCH Comput. Archit. News 35(2), 162–173 (2007). http://doi.acm.org/10.1145/1273440.1250683
Article Google Scholar
Li, Y., Skadron, K., Brooks, D., Hu, Z.: Performance, energy, and thermal considerations for SMT and CMP architectures. In: 11th International Symposium on High-Performance Computer Architecture, HPCA-11 2005, pp. 71–82, February 2005
Google Scholar
Pujari, R.K., Wild, T., Herkersdorf, A., Vogel, B., Henkel, J.: Hardware assisted thread assignment for RISC based MPSoCs in invasive computing. In: 2011 13th International Symposium on Integrated Circuits (ISIC), pp. 106–109, December 2011. http://doi.acm.org/10.1109/ISICir.2011.6131920
Reinders, J.: Intel Threading Building Blocks, 1st edn. O’Reilly & Associates Inc., Sebastopol (2007)
Google Scholar
Virding, R., Wikström, C., Williams, M.: Concurrent Programming in ERLANG, 2nd edn. Prentice Hall International (UK) Ltd., Hertfordshire (1996)
MATH Google Scholar
Wentzlaff, D., Griffin, P., Hoffmann, H., Bao, L., Edwards, B., Ramey, C., Mattina, M., Miao, C.C., Brown III, J.F., Agarwal, A.: On-chip interconnection architecture of the tile processor. IEEE Micro 27(5), 15–31 (2007). http://dx.doi.org/10.1109/MM.2007.89
Article Google Scholar

Download references

Acknowledgement

This work was supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Center “Invasive Computing” (SFB/TR 89).

Author information

Authors and Affiliations

Institute for Integrated Systems, Technische Universität München, Munich, Germany
Ravi Kumar Pujari, Thomas Wild & Andreas Herkersdorf

Authors

Ravi Kumar Pujari
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Wild
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Herkersdorf
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ravi Kumar Pujari .

Editor information

Editors and Affiliations

Deutsches Klimarechenzentrum, Hamburg, Germany
Julian M. Kunkel
Argonne National Laboratory, Lemont, Illinois, USA
Pavan Balaji
University of Tennessee, Knoxville, Tennessee, USA
Jack Dongarra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pujari, R.K., Wild, T., Herkersdorf, A. (2016). TCU: A Multi-Objective Hardware Thread Mapping Unit for HPC Clusters. In: Kunkel, J., Balaji, P., Dongarra, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9697. Springer, Cham. https://doi.org/10.1007/978-3-319-41321-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-41321-1_3
Published: 15 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41320-4
Online ISBN: 978-3-319-41321-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics