Stream processors, like other multi core architectures partition their functional units and storage into multiple processing elements. In contrast to typical architectures, which contain symmetric general-purpose cores and a cache hierarchy, stream processors have a significantly leaner design. Stream processors are specifically designed for the stream execution model, in which applications have large amounts of explicit parallel computation, structured and predictable control, and memory accesses that can be performed at a coarse granularity. Applications in the streaming model are expressed in a gather–compute–scatter form, yielding programs with explicit control over transferring data to and from on-chip memory. Relying on these characteristics, which are common to many media processing and scientific computing applications, stream architectures redefine the boundary between software and hardware responsibilities with software bearing much of the complexity required to manage concurrency, locality, and latency tolerance. Thus, stream processors have minimal control consisting of fetching medium- and coarse-grained instructions and executing them directly on the many ALUs. Moreover, the on-chip storage hierarchy of stream processors is under explicit software control, as is all communication, eliminating the need for complex reactive hardware mechanisms.
KeywordsMemory System Sequencer Group Stream Processor Stream Unit Stream Load
Unable to display preview. Download preview PDF.
Acknowledgments We would like to thank Steve Keckler for his insightful comments as well as the contributions of Jung Ho Ahn, Nuwan Jayasena, and Brucek Khailany. In addition, we are grateful to the entire Imagine and Merrimac teams and the projects’ sponsors.
Imagine was supported by a Sony Stanford Graduate Fellowship, an Intel Foundation Fellowship, the Defense Advanced Research Projects Agency under ARPA order E254 and monitored by the Army Intelligence Center under contract DABT63-96-C0037 and by ARPA order L172 monitored by the Department of the Air Force under contract F29601-00-2-0085.
The Merrimac Project was supported by the Department of Energy ASCI Alliances Program, Contract LLNL-B523583, with Stanford University as well as the NVIDIA Graduate Fellowship program.
U. J. Kapasi, S. Rixner, W. J. Dally, B. Khailany, J. H. Ahn, P. Mattson, and J. D. Owens, “Programmable Stream Processors,” IEEE Computer, August 2003 (©2003 IEEE).
J. H. Ahn, W. J. Dally, B. K. Khailany, U. J. Kapasi, and A. Das, “Evaluating the imaginestream architecture,” In Proceedings of the 31st Annual International Symposium on Computer Architecture (© 2004 IEEE).
Stream Processors Inc., “Stream Processing: Enabling the New Generation of Easyto Use, High-Performance DSPs,” White Paper (© 2007 Stream Processors Inc.).
B. K. Khailany, T. Williams, J. Lin, E. P. Long, M. Rygh, D. W. Tovey, and W. J. Dally, “A Programmable 512 GOPS Stream Processor for Signal, Image, and Video Processing,” Solid-State Circuits, IEEE Journal, 43(1):202–213, 2008 (© 2008 IEEE).
J. H. Ahn, M. Erez, and W. J. Dally, “Tradeoff between Data-, Instruction-, and Thread-level Parallelism in Stream Processors,” In Proceedings of the 21st ACM International Conference on Supercomputing (ICS’07), June 2007 (DOI10.1145/1274971.1274991). © 2007 ACM, Inc. Included here by permission.
- 1.S. Agarwala, A. Rajagopal, A. Hill, M. Joshi, S. Mullinnix, T. Anderson, R. Damodaran, L. Nardini, P. Wiley, P. Groves, J. Apostol, M. Gill, J. Flores, A. Chachad, A. Hales, K. Chirca, K. Panda, R. Venkatasubramanian, P. Eyres, R. Veiamuri, A. Rajaram, M. Krishnan, J. Nelson, J. Frade, M. Rahman, N. Mahmood, U. Narasimha, S. Sinha, S. Krishnan, W. Webster, Due Bui, S. Moharii, N. Common, R. Nair, R. Ramanujam, and M. Ryan. A 65 nm c64x+ multi-core dsp platform for communications infrastructure. Solid-State Circuits Conference, 2007. ISSCC 2007. Digest of Technical Papers. IEEE International, pages 262–601, 11–15 Feb 2007.Google Scholar
- 2.J. H. Ahn. Memory and Control Organizations of Stream Processors. PhD thesis, Stanford University, 2007.Google Scholar
- 3.J. H. Ahn, W. J. Dally, and M. Erez. Tradeoff between data-, instruction-, and Thread-level parallelism in stream processors. In proceedings of the 21st ACM International Conference on Supercomputing (ICS’07), June 2007.Google Scholar
- 4.J. H. Ahn, W. J. Dally, B. Khailany, U. J. Kapasi, and A. Das. Evaluating the imagine stream architecture. In ISCA ’04: Proceedings of the 31st Annual International Symposium on Computer Architecture, page 14, Washington, DC, USA, 2004. IEEE Computer Society.Google Scholar
- 5.J. H. Ahn, M. Erez, and W. J. Dally. Scatter-add in data parallel architectures. In Proceedings of the Symposium on High Performance Computer Architecture, Feb. 2005.Google Scholar
- 6.J. H. Ahn, M. Erez, and W. J. Dally. The design space of data-parallel memory systems. In SC’06, Nov. 2006.Google Scholar
- 7.AMD. AMD ATI Radeon™ HD 2900 Graphics Technology. http://ati.amd.com/products/Radeonhd2900/specs.html
- 8.AMD. Product brief: Quad-core AMD opteron™ procsesor. http: http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_8796_152%23,00.html
- 9.AMD. AMD stream computing SDK, 2008. http://ati.amd.com/technology/streamcomputing/sdkdwnld.html
- 11.I. Buck. Brook specification v0.2. Oct. 2003.Google Scholar
- 12.I. Buck. Stream Computing on Graphics Hardware. PhD thesis, Stanford University, Stanford, CA, USA, 2005. Adviser-Pat Hanrahan.Google Scholar
- 14.J. Buck, S. Ha, E. A. Lee, and D. G. Messerschmitt. Ptolemy: a framework for simulating and prototyping heterogeneous systems. Readings in Hardware/Software co-design, pages 527–543, 2002.Google Scholar
- 15.J. B. Carter, W. C. Hsieh, L. B. Stoller, M. R. Swanson, L. Zhang, and S. A. McKee. Impulse: Memory system support for scientific applications. Journal of Scientific Programming, 7: 195–209, 1999.Google Scholar
- 16.C. H. Crawford, P. Henning, M. Kistler, and C. Wright. Accelerating computing with the cell broadband engine processor. In CF ’08: Proceedings of the 2008 Conference on Computing Frontiers, pages 3–12. ACM, 2008.Google Scholar
- 17.W. J. Dally, P. Hanrahan, M. Erez, T. J. Knight, F. Labonté, J-H Ahn., N. Jayasena, U. J. Kapasi, A. Das, J. Gummaraju, and I. Buck. Merrimac: Supercomputing with streams. In SC’03, Phoenix, Arizona, Nov 2003.Google Scholar
- 18.W. J. Dally and W. Poulton. Digital Systems Engineering. Cambridge University Press, 1998.Google Scholar
- 19.A. Das, W. J. Dally, and P. Mattson. Compiling for stream processing. In PACT ’06: Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, pages 33–42, 2006.Google Scholar
- 20.ELPIDA Memory Inc. 512 M bits XDR™ DRAM, 2005. http://www.elpida.com/pdfs/E0643E20.pdf
- 21.M. Erez. Merrimac – High-Performance and Highly-Efficient Scientific Computing with Streams. PhD thesis, Stanford University, Jan 2007.Google Scholar
- 22.M. Erez, J. H. Ahn, A. Garg, W. J. Dally, and E. Darve. Analysis and performance results of a molecular modeling application on Merrimac. In SC’04, Pittsburgh, Pennsylvaniva, Nov 2004.Google Scholar
- 23.M. Erez, J. H. Ahn, J. Gummaraju, M. Rosenblum, and W. J. Dally. Executing irregular scientific applications on stream architectures. In Proceedings of the 21st ACM International Conference on Supercomputing (ICS’07), June 2007.Google Scholar
- 24.M. Erez, N. Jayasena, T. J. Knight, and W. J. Dally. Fault tolerance techniques for the Merrimac streaming supercomputer. In SC’05, Seattle, Washington, USA, Nov 2005.Google Scholar
- 25.K. Fatahalian, T. J. Knight, M. Houston, M. Erezand, D. R. Horn, L. Leem, J. Y. Park, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: Programming the memory hierarchy. In SC’06, Nov 2006.Google Scholar
- 27.J. Gummaraju, M. Erez, J. Coburn, M. Rosenblum, and W. J. Dally. Architectural support for the stream execution model on general-purpose processors. In PACT ’07: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pages 3–12. IEEE Computer Society, 2007.Google Scholar
- 28.J. Gummaraju and M. Rosenblum. Stream programming on general-purpose processors. pages 343–354, 2005.Google Scholar
- 30.H. P. Hofstee. Power efficient processor architecture and the cell processor. In Proceedings of the 11th International Symposium on High Performance Computer Architecture, Feb 2005.Google Scholar
- 31.Intel® Corp. Pemtium®M processor datasheet. http://download.intel.com/design/mobile/datashts/25261203.pdf, April 2004.
- 32.T. Kanade, A. Yoshida, K. Oda, H. Kano, and M. Tanaka. A stereo machine for video-rate dense depth mapping and its new applications. Proceedings CVPR, 96:196–202, 1996.Google Scholar
- 33.U. J. Kapasi, W. J. Dally, S. Rixner, P. R. Mattson, J. D. Owens, and B. Khailany. Efficient conditional operations for data-parallel architectures. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, pages 159–170, Dec 2000.Google Scholar
- 34.U. J. Kapasi, S. Rixner, W. J. Dally, B. Khailany, J. H. Ahn, P. Mattson, and J. D. Owens. Programmable stream processors. IEEE Computer, Aug 2003.Google Scholar
- 36.B. Khailany. The VLSI Implementation and Evaluation of Area- and Energy-Efficient Streaming Media Processors. PhD thesis, Stanford University, June 2003.Google Scholar
- 37.B. Khailany, W. J. Dally, A. Chang, U. J. Kapasi, J. Namkoong, and B. Towles. VLSI design and verification of the Imagine processor. In Proceedings of the IEEE International Conference on Computer Design, pages 289–294, Sep 2002.Google Scholar
- 38.B. Khailany, W. J. Dally, S. Rixner, U. J. Kapasi, P. Mattson, J. Namkoong, J. D. Owens, B. Towles, and A. Chang. Imagine: Media processing with streams. IEEE Micro, pages 35–46, Mar/Apr 2001.Google Scholar
- 39.B. Khailany, W. J. Dally, S. Rixner, U. J. Kapasi, J. D. Owen, and B. Towles. Exploring the VLSI scalability of stream processors. In Proceedings of the Ninth Symposium on High Performance Computer Architecture, pages 153–164, Anaheim, CA, USA, Feb 2003.Google Scholar
- 40.R. Kleihorst, A. Abbo, B. Schueler, and A. Danilin. Camera mote with a high-performance parallel processor for real-time frame-based video processing. Distributed Smart Cameras, 2007. ICDSC ’07. First ACM/IEEE International Conference, pages 109–116, 25–28 Sept 2007.Google Scholar
- 41.E. A. Lee and D. G. Messerschmitt. Static scheduling of synchronous data flow programs for digital signal processing. IEEE Transactions on Computers, Jan 1987.Google Scholar
- 42.A. A. Liddicoat and M. J. Flynn. High-performance floating point divide. In Proceedings of the Euromicro Symposium on Digital System Design, pages 354–361, Sept 2001.Google Scholar
- 43.P. Mattson. A Programming System for the Imagine Media Processor. PhD thesis, Stanford University, 2002.Google Scholar
- 44.P. Mattson, W. J. Dally, S. Rixner, U. J. Kapasi, and J. D. Owens. Communication scheduling. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 82–92, 2000.Google Scholar
- 45.MIPS Technologies. MIPS64 20Kc Core, 2004. http://www.mips.com/ProductCatalog/P_MIPS6420KcCore
- 46.NVIDIA®. NVIDIA’s Unified Architecture GeForce® 8 Series GPUs. http://www.nvidia.com/page/geforce8.html
- 47.J. D. Owens, W. J. Dally, U. J. Kapasi, S. Rixner, P. Mattson, and B. Mowery. Polygon rendering on a stream architecture. In HWWS ’00: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics hardware, pages 23–32, 2000.Google Scholar
- 48.J. D. Owens, B. Khailany, B. Towles, and W. J. Dally. Comparing reyes and OpenGL on a stream architecture. In HWWS ’02: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pages 47–56, 2002.Google Scholar
- 49.S. Rixner, W. J. Dally, U. J. Kapasi, B. Khailany, A. Lopez-Lagunas, P. R. Mattson, and J. D. Owens. A bandwidth-efficient architecture for media processing. In Proceedings of the 31st Annual IEEE/ACM International Symposium on Microarchitecture, Dallas, TX, November 1998.Google Scholar
- 50.S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. Memory access scheduling. In Proceedings of the 27th Annual International Symposium on Computer Architecture, June 2000.Google Scholar
- 51.S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. J. Kapasi, and J. D. Owens. Register organization for media processing. In Proceedings of the 6th International Symposium on High Performance Computer Architecture, Toulouse, France, Jan 2000.Google Scholar
- 52.Semiconductor Industry Association. The International Technology Roadmap for Semiconductors, 2005 Edition.Google Scholar
- 53.Texas Instruments. TMS320C6713 floating-point digital signal processor, datasheet SPRS186D, dec. 2001. http://focus.ti.com/lit/ds/symlink/tms320c6713.pdf, May 2003.
- 54.W. Thies, M. Karczmarek, and S. P. Amarasinghe. StreamIt: a language for streaming applications. In Proceedings of the 11th International Conference on Compiler Construction, pages 179–196, Apr 2002.Google Scholar
- 55.D. van der Spoel, A. R. van Buuren, E. Apol, P. J. Meulen -hoff, D. Peter Tieleman, A. L. T. M. Sij bers, B. Hess, K. Anton Feenstra, E. Lindahl, R. van Drunen, and H. J. C. Berendsen. Gromacs User Manual version 3.1. Nij enborgh 4, 9747 AG Groningen, The Netherlands. Internet: http://www.gromacs.org, 2001.