Buffer Sizing for Self-timed Stream Programs on Heterogeneous Distributed Memory Multiprocessors
Stream programming is a promising way to expose concurrency to the compiler. A stream program is built from kernels that communicate only via point-to-point streams. The stream compiler statically allocates these kernels to processors, applying blocking, fission and fusion transformations. The compiler determines the sizes of the communication buffers, which affects performance since local memories can be small.
In this paper, we propose a feedback-directed algorithm that determines the size of each communication buffer, based on i) the stream program that has been mapped onto processors, ii) feedback from an earlier execution, and iii) the memory constraints. The algorithm exposes a trade-off between throughput and latency. It is general, in that it applies to stream programs with unstructured stream graphs, and it supports variable execution times and communication rates.
We show results for the StreamIt benchmarks and random graphs. For the StreamIt benchmarks, throughput is optimal after the first iteration. For random graphs with stochastic computation times, throughput is within 3% of optimal after four iterations. Compared with the previous general algorithm, by Basten and Hoogerbrugge, our algorithm has significantly better performance and latency.
KeywordsRandom Graph Communication Rate Pipeline Stage Level Algorithm Baseline Algorithm
Unable to display preview. Download preview PDF.
- 2.Pham, D., Behnen, E., Bolliger, M., Hofstee, H.: et al.: The design methodology and implementation of a first-generation Cell processor: a multi-core SoC. In: Custom Integrated Circuits Conference 2005, pp. 45–49 (2005)Google Scholar
- 3.Kudlur, M., Mahlke, S.: Orchestrating the execution of stream programs on multicore platforms. In: Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation, pp. 114–124 (2008)Google Scholar
- 4.Choi, Y., Lin, Y., Chong, N., Mahlke, S., Mudge, T.: Stream Compilation for Real-Time Embedded Multicore Systems. In: Proceedings of the 2009 International Symposium on Code Generation and Optimization, vol. 00, pp. 210–220 (2009)Google Scholar
- 5.IST-034869, A.: Advanced Compiler Technologies for Embedded Streaming, http://www.hitech-projects.com/euprojects/ACOTES/
- 6.ACOTES: IST ACOTES Project Deliverable D2.2 Report on Streaming Programming Model and Abstract Streaming Machine Description Final Version (2008)Google Scholar
- 8.Hofstee, H.P.: Power efficient processor architecture and the cell processor, pp. 258–262. IEEE Computer Society, Los Alamitos (2005)Google Scholar
- 9.Parks, T.: Bounded scheduling of process networks. PhD thesis, University of California (1995)Google Scholar
- 10.Buck, J.: Scheduling dynamic dataflow graphs with bounded memory using the token flow model. PhD thesis, University of California (1993)Google Scholar
- 11.Geilen, M., Basten, T.: Requirements on the execution of Kahn process networks. LNCS, pp. 319–334. Springer, Heidelberg (2003)Google Scholar
- 12.van der Wolf, P., de Kock, E., Henriksson, T., Kruijtzer, W., Essink, G.: Design and programming of embedded multiprocessors: an interface-centric approach. In: Proceedings of the 2nd international conference on Hardware/software codesign and system synthesis, pp. 206–217 (2004)Google Scholar
- 13.Carpenter, P.M., Ramirez, A., Ayguade, E.: The Abstract Streaming Machine: Compile-Time Performance Modelling of Stream Programs on Heterogeneous Multiprocessors. In: SAMOS Workshop 2009, pp. 12–23. Springer, Heidelberg (2009)Google Scholar
- 16.Govindarajan, R., Gao, G.: A novel framework for multi-rate scheduling in DSP applications. In: International Conference on Application-Specific Array Processors, pp. 77–88 (1993)Google Scholar
- 18.Lee, E.A.: A coupled hardware and software architecture for programmable digital signal processors (synchronous data flow). PhD thesis (1986)Google Scholar
- 20.Pollack, M.: The maximum capacity through a network. Operations Research, 733–736 (1960)Google Scholar
- 23.Basten, T., Hoogerbrugge, J.: Efficient execution of process networks. Communicating Process Architectures (2001)Google Scholar
- 24.Gordon, M., Thies, W., Amarasinghe, S.: Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. ASPLOS, 151–162 (2006)Google Scholar
- 25.Carpenter, P.M., Ramirez, A., Ayguade, E.: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems. In: CASES 2009, October 11-16 (2009)Google Scholar
- 26.Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach, 4th edn. Morgan Kaufmann, San Francisco (2007)Google Scholar
- 27.Stuijk, S., Geilen, M., Basten, T.: Exploring trade-offs in buffer requirements and throughput constraints for synchronous dataflow graphs. In: Proceedings of the 43rd annual conference on Design automation, pp. 899–904 (2006)Google Scholar