Toward a Core Design to Distribute an Execution on a Manycore Processor

Goossens, Bernard; Parello, David; Porada, Katarzyna; Rahmoune, Djallal

doi:10.1007/978-3-319-21909-7_38

Toward a Core Design to Distribute an Execution on a Manycore Processor

Bernard Goossens^14,15,
David Parello^14,15,
Katarzyna Porada^14,15 &
…
Djallal Rahmoune^14,15

Conference paper
First Online: 01 January 2015

958 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9251))

Abstract

This paper presents a parallel execution model and a core design to run C programs in parallel. The model automatically builds parallel flows of machine instructions from the run trace. It parallelizes instruction fetch, renaming, execution and retirement. Predictor based fetch is replaced by a fetch-decode-and-partly-execute stage able to compute in-order most of the control instructions. Tomasulo’s register renaming is extended to memory with a technique to match consumer/producer pairs. The Reorder Buffer is adapted to parallel retirement. A sum reduction code is used to illustrate the model and to give a short analytical evaluation of its performance potential.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The stack in each section keeps its local variables, e.g. temp on Fig. 5.
2.
“Good” model with a 2 K instructions window size, 64 instructions issued per cycle, 256 renaming registers, a branch predictor based on an infinite number of 2-bits counters and a perfect memory aliasing disambiguation.
3.
“Perfect” model enhances “good” model: infinite renaming, perfect branch predictor.
4.
Hosting core choice to optimize load balancing is out of the scope of this paper.
5.
Memory renaming duplicates same address based stack frames. This allows multiple sections to update their local variables in their frames in parallel.
6.
In the sum example, the conditional branches are all computed in the fetch stage, allowing the parallelization of the fetch by fetching fastly the fork instructions.
7.
Stores update full lines. The loader sets a cleared line and loops to update it successively with t[0] up to t[4]. The full line right padded with zeros is exported to its first consumer, i.e. section 1. Sections 2 and 3 get section 1 cached copy.
8.
The oldest section, i.e. the only one with no predecessor, dumps its renamings to the data memory hierarchy (DMH). When it receives a renaming request which misses, it loads from DMH and exports the loaded line.
9.
15 cycles is the fetch time of instructions (Fig. 5) 2, 3, 8-10 (5 cycles), the creation time of the forked section (2 cycles), the fetch time of instructions 11-16 (5 cycles) and the retirement of instructions 17-19 (3 cycles).

References

Shun, J., Blelloch, G.E., Fineman, J.T., Gibbons, P.B., Kyrola, A., Simhadri, H.V., Tangwongsan, K.: Brief announcement: the problem based benchmark suite. In: Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2012, pp. 68–70 (2012)
Google Scholar
Wall, D.W.: Limits of instruction-level parallelism. In: WRL Technical Note TN-15 (1990)
Google Scholar
Tomasulo, R.M.: An efficient algorithm for exploiting multiple arithmetic units. IBM J. Res. Dev. 11, 25–33 (1967)
Article MATH Google Scholar
Tjaden, G.S., Flynn, M.J.: Detection and parallel execution of independent instructions. IEEE Trans. Comput. 19, 889–895 (1970)
Article Google Scholar
Nicolau, A., Fisher, J.: Measuring the parallelism available for very long instruction word architectures. IEEE Trans. Comput. C–33, 968–976 (1984)
Article Google Scholar
Austin, T.M., Sohi, G.S.: Dynamic dependency analysis of ordinary programs. In: Proceedings of the 19th Annual International Symposium on Computer Architecture, ISCA 1992, pp. 342–351 (1992)
Google Scholar
Lam, M.S., Wilson, R.P.: Limits of control flow on parallelism. In: Proceedings of the 19th Annual International Symposium on Computer Architecture, ISCA 1992, pp. 46–57 (1992)
Google Scholar
Moshovos, A., Breach, S.E., Vijaykumar, T.N., Sohi, G.S.: Dynamic speculation and synchronization of data dependences. In: Proceedings of the 24th Annual International Symposium on Computer Architecture, ISCA 1997, pp. 181–193 (1997)
Google Scholar
Postiff, M.A., Greene, D.A., Tyson, G.S., Mudge, T.N.: The limits of instruction level parallelism in SPEC95 applications. In: CAN, vol. 27, pp. 31–34 (1999)
Google Scholar
Cristal, A., Santana, O.J., Valero, M., Martínez, J.F.: Toward kilo-instruction processors. ACM Trans. Archit. Code Optim. 1, 389–417 (2004)
Article Google Scholar
Sharafeddine, M., Jothi, K., Akkary, H.: Disjoint out-of-order execution processor. ACM Trans. Archit. Code Optim. (TACO) 9, 19:1–19:32 (2012)
Google Scholar
Goossens, B., Parello, D.: Limits of instruction-level parallelism capture. Procedia Comput. Sci. 18, 1664–1673 (2013). 2013 International Conference on Computational Science
Article Google Scholar

Download references

Author information

Authors and Affiliations

DALI, UPVD, 66860, Perpignan Cedex 9, France
Bernard Goossens, David Parello, Katarzyna Porada & Djallal Rahmoune
LIRMM, CNRS: UMR 5506 - UM2, 34095, Montpellier Cedex 5, France
Bernard Goossens, David Parello, Katarzyna Porada & Djallal Rahmoune

Authors

Bernard Goossens
View author publications
You can also search for this author in PubMed Google Scholar
David Parello
View author publications
You can also search for this author in PubMed Google Scholar
Katarzyna Porada
View author publications
You can also search for this author in PubMed Google Scholar
Djallal Rahmoune
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bernard Goossens .

Editor information

Editors and Affiliations

Russian Academy of Sciences, Novosibirsk, Russia
Victor Malyshkin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Goossens, B., Parello, D., Porada, K., Rahmoune, D. (2015). Toward a Core Design to Distribute an Execution on a Manycore Processor. In: Malyshkin, V. (eds) Parallel Computing Technologies. PaCT 2015. Lecture Notes in Computer Science(), vol 9251. Springer, Cham. https://doi.org/10.1007/978-3-319-21909-7_38

Download citation

DOI: https://doi.org/10.1007/978-3-319-21909-7_38
Published: 25 July 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21908-0
Online ISBN: 978-3-319-21909-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics