Design space exploration of hardware task superscalar architecture

Yazdanpanah, Fahimeh; Alaei, Mohammad

doi:10.1007/s11227-015-1449-1

Design space exploration of hardware task superscalar architecture

Published: 10 June 2015

Volume 71, pages 3567–3592, (2015)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Fahimeh Yazdanpanah¹ &
Mohammad Alaei¹

207 Accesses
1 Citation
Explore all metrics

Abstract

For current high performance computing systems, exploiting concurrency is a serious and important challenge. Recently, several dynamic software task management mechanisms have been proposed. In particular, task-based dataflow programming models which benefit from dataflow principles to improve task-level parallelism and overcome the limitations of static task management systems. However, these programming models rely on software-based dependency analysis, which are performed inherently slowly; and this limits their scalability specially when there is fine-grained task granularity and a large amount of tasks. Moreover, task scheduling in software introduces overheads, and so becomes increasingly inefficient with the number of cores. In contrast, a hardware scheduling solution, like Task SuperScalar (TSS), can achieve greater values of speed-up because a hardware task scheduler requires fewer cycles than the software version to dispatch a task. TSS combines the effectiveness of Out-of-Order processors together with the task abstraction. It has been implemented in software with limited parallelism and high memory consumption due to the nature of the software implementation. Hardware Task Superscalar (HTSS) is proposed to solve these drawbacks. HTSS is designed to be integrated in a future high performance computer with the ability to exploit fine-grained task parallelism. In this article, a deep latency and design space exploration of HTSS is described. For design space exploration, we have designed a full cycle-accurate simulator of HTSS, called SimTSS. The simulator has been tuned based on latency exploration of HTSS components resulted from VHDL description of each component. As the result of this exploration, we have found the number of components and memory capacity of HTSS for HPC systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Al-Kadi G, Terechko AS (2009) A hardware task scheduler for embedded video processing. In: Proceedings of the international conference on high performance and embedded architectures and compilers (HiPEAC), pp 140–152
Badia RM (2011) Top down programming methodology and tools with StarSs, enabling scalable programming paradigms: extended abstract. In: Proceedings of the workshop on scalable algorithms for large-scale systems (ScalA), pp 19–20
Bellens P, Perez JM, Cabarcas F, Ramirez A, Badia RM, Labarta J (2009) CellSs: scheduling techniques to better exploit memory hierarchy. Sci Program 17(1–2):77–95
Google Scholar
Bellens P, Perez J, Badia R, Labarta J (2006) CellSs: a programming model for the cell BE architecture. In: Proceedings of the supercomputing (SC). ACM, New York
Bsc application repository, bar (2014). In: Barcelona Supercomputing Center (BSC). https://pm.bsc.es/projects/bar. Accessed 06 Feb 2014
Bueno J, Martinell L, Duran A, Farreras M, Martorell X, Badia RM, Ayguade E, Labarta J (2011) Productive cluster programming with OmpSs. In: Proceedings of the International conference on parallel processing (Euro-Par), pp 555–566
Castrillon J, Zhang D, Kempf T, Vanthournout B, Leupers R, Ascheid G (2009) Task management in MPSoCs: an ASIP approach. In: Proceedings of the international conference on computer-aided design (ICCAD), pp 587–594
Duran A, Ayguade E, Badia RM, Labarta J, Martinell L, Martorell X, Planas J (2011) Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Process Lett 21(2):173–193
Article MathSciNet Google Scholar
Etsion Y, Cabarcas F, Rico A, Ramirez A, Badia RM, Ayguade E, Labarta J, Valero M (2010) Task superscalar: an out-of-order task pipeline. In: Proceedings of the international symposium on microarchitecture (MICRO), pp 89–100
Etsion Y, Ramirez A, Badia RM, Ayguade E, Labarta J, Valero M (2010) Task superscalar: using processors as functional units. In: Proceedings of the hot topics in parallelism (HOTPAR)
Hoogerbrugge J, Terechko A (2011) A multithreaded multicore system for embedded media processing. Trans High-Perform Embedded Archit Compil (THEA) 3(2):154–173 (2011)
Jenista JC, Eom YH, Demsky B (2010) OoOJava: an out-of-order approach to parallel programming. In: Proceedings of the USENIX conference on hot topic in parallelism (HotPar), pp 11–11
Jenista JC, Eom YH, Demsky BC (2011) OoOJava: software out-of-order execution. In: Proceedings of the ACM symposium on principles and practice of parallel programming (PPoPP), pp 57–68
Kalra R, Lysecky R (2010) Configuration locking and schedulability estimation for reduced reconfiguration overheads of reconfigurable systems. IEEE Trans Very Large Scale Integr Sys 18(4):671–674
Article Google Scholar
Kish LB (2002) End of Moore’s law: thermal (noise) death of integration in micro and nano electronics. Phys Lett A 305:144–149
Article Google Scholar
Kish LB (2004) Moore’s law and the energy requirement of computing versus performance. IEE Proc Circuits Dev Syst 151(2):190–194
Article MathSciNet Google Scholar
Kumar S, Hughes CJ, Nguyen A (2007) Carbon: Architectural support for fine-grained parallelism on chip multiprocessors. In: Proceedings of the international symposium on computer architecture (ISCA), pp 162–173
Lam MS, Rinard MC (1991) Coarse-grain parallel programming in Jade. In: Proceedings of the ACM symposium on principles and practice of parallel programming (PPoPP). ACM, New York, pp 94–105
Lindholm E, Nickolls J, Oberman S, Montrym J (2008) NVIDIA Tesla: a unified graphics and computing architecture. IEEE Micro 28(2):39–55
Article Google Scholar
Meenderinck C, Juurlink B (2010) A case for hardware task management support for the StarSs programming model. In: Proceedings of the conference on digital system design (DSD), pp 347–354
Meenderinck C, Juurlink B (2011) Nexus: hardware support for task-based programming. In: Proceedings of the conference on digital system design (DSD), pp 442–445
Nacul AC, Regazzoni F, Lajolo M (2007) Hardware scheduling support in SMP architectures. In: Proceedings of the conference on design, automation and test in Europe (DATE), pp 642–647
Noguera J, Badia RM (2003) System-level power-performance trade-offs in task scheduling for dynamically reconfigurable architectures. In: Proceedings of the international conference on compilers, architectures and synthesis for embedded systems (CASES), pp 73–83
Noguera J, Badia RM (2004) Multitasking on reconfigurable architectures: microarchitecture support and dynamic scheduling. ACM Trans Embedded Comput Syst 3(2):385–406
Article Google Scholar
Openmp application program interface, version 4.0 (2013). www.openmp.org/. Accessed 06 Feb 2014
Park S (2008) A hardware operating system kernel for multi processors. IEICE Electron Express 5(9):296–302
Article Google Scholar
Pearson PK (1990) Fast hashing of variable-length text strings. Commun ACM 33(6):677–680
Article Google Scholar
Perez, Badia RM, Labarta J (2008) A dependency-aware task-based programming environment for multi-core architectures. In: Proceedings of the international conference on cluster computing (CC), pp 142–151
Rinard MC, Lam MS (1998) The design, implementation, and evaluation of Jade. ACM Trans Program Lang Syst (TPLS) 20(3):483–545
Article Google Scholar
Rinard MC, Scales DJ, Lam MS (1992) Heterogeneous parallel programming in Jade. In: Proceedings of the conference on supercomputing, pp 245–256
Rinard MC, Scales DJ, Lam MS (1993) Jade: a high-level, machine-independent language for parallel programming. Computer 26(6):28–38
Article Google Scholar
Saez S, Vila J, Crespo A, Garcia A (1999) A hardware scheduler for complex real time system. In: Proceedings of the IEEE international symposium industrial electronics (ISIE). IEEE, pp 43–48
Sjalander M, Terechko A, Duranton M (2008) A look-ahead task management unit for embedded multi-core architectures. In: Proceedings of the conference on digital system design (DSD), pp 149–157
Yazdanpanah F, Alvarez C, Jimenez-Gonalez D, Badia RM, Valero M (2015) Picos: a hardware runtime architecture support for ompss. Future Gener Comput Syst
Yazdanpanah F, Jimenez-Gonzalez D, Alvarez-Martinez C, Etsion Y (2013) Hybrid dataflow/von-Neumann architectures. IEEE Trans Parallel Distrib Syst (TPDS) 25(6):1489–1509
Yazdanpanah F, Jimenez-Gonzalez D, Alvarez-Martinez C, Etsion Y, Badia RM (2013) Analysis of the task superscalar architecture hardware design. In: Proceedings of the international conference on computational science (ICCS)
Yazdanpanah F, Jimenez-Gonzalez D, Alvarez-Martinez C, Etsion Y, Badia RM (2013) FPGA-based prototype of the task superscalar architecture. In: Proceedings of the 7th HiPEAC workshop of reconfigurable computing (WRC)

Download references

Author information

Authors and Affiliations

Computer Engineering Department, Faculty of Engineering, Shahid Bahonar University of Kerman, Kerman, Iran
Fahimeh Yazdanpanah & Mohammad Alaei

Authors

Fahimeh Yazdanpanah
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Alaei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fahimeh Yazdanpanah.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yazdanpanah, F., Alaei, M. Design space exploration of hardware task superscalar architecture. J Supercomput 71, 3567–3592 (2015). https://doi.org/10.1007/s11227-015-1449-1

Download citation

Published: 10 June 2015
Issue Date: September 2015
DOI: https://doi.org/10.1007/s11227-015-1449-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Design space exploration of hardware task superscalar architecture

Abstract

Access this article

Similar content being viewed by others

Generalized Approach to Enhance the Shared Cache Performance in Multicore Platform

Architectural support for task scheduling: hardware scheduling for dataflow on NUMA systems

SPORTS: A Semi-partitioned Real-Time Scheduler for Heterogeneous Multicore Platforms

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Design space exploration of hardware task superscalar architecture

Abstract

Access this article

Similar content being viewed by others

Generalized Approach to Enhance the Shared Cache Performance in Multicore Platform

Architectural support for task scheduling: hardware scheduling for dataflow on NUMA systems

SPORTS: A Semi-partitioned Real-Time Scheduler for Heterogeneous Multicore Platforms

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation