Advertisement

Stream Processors

  • Mattan Erez
  • William J. Dally
Chapter
Part of the Integrated Circuits and Systems book series (ICIR)

Abstract

Stream processors, like other multi core architectures partition their functional units and storage into multiple processing elements. In contrast to typical architectures, which contain symmetric general-purpose cores and a cache hierarchy, stream processors have a significantly leaner design. Stream processors are specifically designed for the stream execution model, in which applications have large amounts of explicit parallel computation, structured and predictable control, and memory accesses that can be performed at a coarse granularity. Applications in the streaming model are expressed in a gather–compute–scatter form, yielding programs with explicit control over transferring data to and from on-chip memory. Relying on these characteristics, which are common to many media processing and scientific computing applications, stream architectures redefine the boundary between software and hardware responsibilities with software bearing much of the complexity required to manage concurrency, locality, and latency tolerance. Thus, stream processors have minimal control consisting of fetching medium- and coarse-grained instructions and executing them directly on the many ALUs. Moreover, the on-chip storage hierarchy of stream processors is under explicit software control, as is all communication, eliminating the need for complex reactive hardware mechanisms.

Keywords

Memory System Sequencer Group Stream Processor Stream Unit Stream Load 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Notes

Acknowledgments

Acknowledgments We would like to thank Steve Keckler for his insightful comments as well as the contributions of Jung Ho Ahn, Nuwan Jayasena, and Brucek Khailany. In addition, we are grateful to the entire Imagine and Merrimac teams and the projects’ sponsors.

Imagine was supported by a Sony Stanford Graduate Fellowship, an Intel Foundation Fellowship, the Defense Advanced Research Projects Agency under ARPA order E254 and monitored by the Army Intelligence Center under contract DABT63-96-C0037 and by ARPA order L172 monitored by the Department of the Air Force under contract F29601-00-2-0085.

The Merrimac Project was supported by the Department of Energy ASCI Alliances Program, Contract LLNL-B523583, with Stanford University as well as the NVIDIA Graduate Fellowship program.

Portions of this chapter are reprinted with permission from the following sources:
  • U. J. Kapasi, S. Rixner, W. J. Dally, B. Khailany, J. H. Ahn, P. Mattson, and J. D. Owens, “Programmable Stream Processors,” IEEE Computer, August 2003 (©2003 IEEE).

  • J. H. Ahn, W. J. Dally, B. K. Khailany, U. J. Kapasi, and A. Das, “Evaluating the imaginestream architecture,” In Proceedings of the 31st Annual International Symposium on Computer Architecture (© 2004 IEEE).

  • Stream Processors Inc., “Stream Processing: Enabling the New Generation of Easyto Use, High-Performance DSPs,” White Paper (© 2007 Stream Processors Inc.).

  • B. K. Khailany, T. Williams, J. Lin, E. P. Long, M. Rygh, D. W. Tovey, and W. J. Dally, “A Programmable 512 GOPS Stream Processor for Signal, Image, and Video Processing,” Solid-State Circuits, IEEE Journal, 43(1):202–213, 2008 (© 2008 IEEE).

  • J. H. Ahn, M. Erez, and W. J. Dally, “Tradeoff between Data-, Instruction-, and Thread-level Parallelism in Stream Processors,” In Proceedings of the 21st ACM International Conference on Supercomputing (ICS’07), June 2007 (DOI10.1145/1274971.1274991). © 2007 ACM, Inc. Included here by permission.

References

  1. 1.
    S. Agarwala, A. Rajagopal, A. Hill, M. Joshi, S. Mullinnix, T. Anderson, R. Damodaran, L. Nardini, P. Wiley, P. Groves, J. Apostol, M. Gill, J. Flores, A. Chachad, A. Hales, K. Chirca, K. Panda, R. Venkatasubramanian, P. Eyres, R. Veiamuri, A. Rajaram, M. Krishnan, J. Nelson, J. Frade, M. Rahman, N. Mahmood, U. Narasimha, S. Sinha, S. Krishnan, W. Webster, Due Bui, S. Moharii, N. Common, R. Nair, R. Ramanujam, and M. Ryan. A 65 nm c64x+ multi-core dsp platform for communications infrastructure. Solid-State Circuits Conference, 2007. ISSCC 2007. Digest of Technical Papers. IEEE International, pages 262–601, 11–15 Feb 2007.Google Scholar
  2. 2.
    J. H. Ahn. Memory and Control Organizations of Stream Processors. PhD thesis, Stanford University, 2007.Google Scholar
  3. 3.
    J. H. Ahn, W. J. Dally, and M. Erez. Tradeoff between data-, instruction-, and Thread-level parallelism in stream processors. In proceedings of the 21st ACM International Conference on Supercomputing (ICS’07), June 2007.Google Scholar
  4. 4.
    J. H. Ahn, W. J. Dally, B. Khailany, U. J. Kapasi, and A. Das. Evaluating the imagine stream architecture. In ISCA ’04: Proceedings of the 31st Annual International Symposium on Computer Architecture, page 14, Washington, DC, USA, 2004. IEEE Computer Society.Google Scholar
  5. 5.
    J. H. Ahn, M. Erez, and W. J. Dally. Scatter-add in data parallel architectures. In Proceedings of the Symposium on High Performance Computer Architecture, Feb. 2005.Google Scholar
  6. 6.
    J. H. Ahn, M. Erez, and W. J. Dally. The design space of data-parallel memory systems. In SC’06, Nov. 2006.Google Scholar
  7. 7.
    AMD. AMD ATI Radeon™ HD 2900 Graphics Technology. http://ati.amd.com/products/Radeonhd2900/specs.html
  8. 8.
    AMD. Product brief: Quad-core AMD opteron™ procsesor. http: http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_8796_152%23,00.html
  9. 9.
  10. 10.
    S. S. Bhattacharyya, P. K. Murthy, and E. A. Lee. Software Synthesis from Dataflow Graphs. Kluwer Academic Press, Norwell, MA, 1996.MATHGoogle Scholar
  11. 11.
    I. Buck. Brook specification v0.2. Oct. 2003.Google Scholar
  12. 12.
    I. Buck. Stream Computing on Graphics Hardware. PhD thesis, Stanford University, Stanford, CA, USA, 2005. Adviser-Pat Hanrahan.Google Scholar
  13. 13.
    I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: stream computing on graphics hardware. ACM Transactions on Graphics, 23(3):777–786, 2004.CrossRefGoogle Scholar
  14. 14.
    J. Buck, S. Ha, E. A. Lee, and D. G. Messerschmitt. Ptolemy: a framework for simulating and prototyping heterogeneous systems. Readings in Hardware/Software co-design, pages 527–543, 2002.Google Scholar
  15. 15.
    J. B. Carter, W. C. Hsieh, L. B. Stoller, M. R. Swanson, L. Zhang, and S. A. McKee. Impulse: Memory system support for scientific applications. Journal of Scientific Programming, 7: 195–209, 1999.Google Scholar
  16. 16.
    C. H. Crawford, P. Henning, M. Kistler, and C. Wright. Accelerating computing with the cell broadband engine processor. In CF ’08: Proceedings of the 2008 Conference on Computing Frontiers, pages 3–12. ACM, 2008.Google Scholar
  17. 17.
    W. J. Dally, P. Hanrahan, M. Erez, T. J. Knight, F. Labonté, J-H Ahn., N. Jayasena, U. J. Kapasi, A. Das, J. Gummaraju, and I. Buck. Merrimac: Supercomputing with streams. In SC’03, Phoenix, Arizona, Nov 2003.Google Scholar
  18. 18.
    W. J. Dally and W. Poulton. Digital Systems Engineering. Cambridge University Press, 1998.Google Scholar
  19. 19.
    A. Das, W. J. Dally, and P. Mattson. Compiling for stream processing. In PACT ’06: Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, pages 33–42, 2006.Google Scholar
  20. 20.
    ELPIDA Memory Inc. 512 M bits XDR™ DRAM, 2005. http://www.elpida.com/pdfs/E0643E20.pdf
  21. 21.
    M. Erez. Merrimac – High-Performance and Highly-Efficient Scientific Computing with Streams. PhD thesis, Stanford University, Jan 2007.Google Scholar
  22. 22.
    M. Erez, J. H. Ahn, A. Garg, W. J. Dally, and E. Darve. Analysis and performance results of a molecular modeling application on Merrimac. In SC’04, Pittsburgh, Pennsylvaniva, Nov 2004.Google Scholar
  23. 23.
    M. Erez, J. H. Ahn, J. Gummaraju, M. Rosenblum, and W. J. Dally. Executing irregular scientific applications on stream architectures. In Proceedings of the 21st ACM International Conference on Supercomputing (ICS’07), June 2007.Google Scholar
  24. 24.
    M. Erez, N. Jayasena, T. J. Knight, and W. J. Dally. Fault tolerance techniques for the Merrimac streaming supercomputer. In SC’05, Seattle, Washington, USA, Nov 2005.Google Scholar
  25. 25.
    K. Fatahalian, T. J. Knight, M. Houston, M. Erezand, D. R. Horn, L. Leem, J. Y. Park, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: Programming the memory hierarchy. In SC’06, Nov 2006.Google Scholar
  26. 26.
    J. Gummaraju, J. Coburn, Y. Turner, and M. Rosenblum. Streamware: programming general-purpose multicore processors using streams. SIGARCH Computer Architecture News, 36(1):297–307, 2008.CrossRefGoogle Scholar
  27. 27.
    J. Gummaraju, M. Erez, J. Coburn, M. Rosenblum, and W. J. Dally. Architectural support for the stream execution model on general-purpose processors. In PACT ’07: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pages 3–12. IEEE Computer Society, 2007.Google Scholar
  28. 28.
    J. Gummaraju and M. Rosenblum. Stream programming on general-purpose processors. pages 343–354, 2005.Google Scholar
  29. 29.
    R. Ho, K. W. Mai, and M. A. Horowitz. The future of wires. Proceedings of the IEEE, 89(4):14–25, Apr 2001.CrossRefGoogle Scholar
  30. 30.
    H. P. Hofstee. Power efficient processor architecture and the cell processor. In Proceedings of the 11th International Symposium on High Performance Computer Architecture, Feb 2005.Google Scholar
  31. 31.
    Intel® Corp. Pemtium®M processor datasheet. http://download.intel.com/design/mobile/datashts/25261203.pdf, April 2004.
  32. 32.
    T. Kanade, A. Yoshida, K. Oda, H. Kano, and M. Tanaka. A stereo machine for video-rate dense depth mapping and its new applications. Proceedings CVPR, 96:196–202, 1996.Google Scholar
  33. 33.
    U. J. Kapasi, W. J. Dally, S. Rixner, P. R. Mattson, J. D. Owens, and B. Khailany. Efficient conditional operations for data-parallel architectures. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, pages 159–170, Dec 2000.Google Scholar
  34. 34.
    U. J. Kapasi, S. Rixner, W. J. Dally, B. Khailany, J. H. Ahn, P. Mattson, and J. D. Owens. Programmable stream processors. IEEE Computer, Aug 2003.Google Scholar
  35. 35.
    B. K. Khailany, T. Williams, J. Lin, E.P. Long, M. Rygh, D.W. Tovey, and W.J. Dally. A Programmable 512 GOPS stream processor for signal, image, and video processing. Solid-State Circuits, IEEE Journal, 43(1):202–213, 2008.CrossRefGoogle Scholar
  36. 36.
    B. Khailany. The VLSI Implementation and Evaluation of Area- and Energy-Efficient Streaming Media Processors. PhD thesis, Stanford University, June 2003.Google Scholar
  37. 37.
    B. Khailany, W. J. Dally, A. Chang, U. J. Kapasi, J. Namkoong, and B. Towles. VLSI design and verification of the Imagine processor. In Proceedings of the IEEE International Conference on Computer Design, pages 289–294, Sep 2002.Google Scholar
  38. 38.
    B. Khailany, W. J. Dally, S. Rixner, U. J. Kapasi, P. Mattson, J. Namkoong, J. D. Owens, B. Towles, and A. Chang. Imagine: Media processing with streams. IEEE Micro, pages 35–46, Mar/Apr 2001.Google Scholar
  39. 39.
    B. Khailany, W. J. Dally, S. Rixner, U. J. Kapasi, J. D. Owen, and B. Towles. Exploring the VLSI scalability of stream processors. In Proceedings of the Ninth Symposium on High Performance Computer Architecture, pages 153–164, Anaheim, CA, USA, Feb 2003.Google Scholar
  40. 40.
    R. Kleihorst, A. Abbo, B. Schueler, and A. Danilin. Camera mote with a high-performance parallel processor for real-time frame-based video processing. Distributed Smart Cameras, 2007. ICDSC ’07. First ACM/IEEE International Conference, pages 109–116, 25–28 Sept 2007.Google Scholar
  41. 41.
    E. A. Lee and D. G. Messerschmitt. Static scheduling of synchronous data flow programs for digital signal processing. IEEE Transactions on Computers, Jan 1987.Google Scholar
  42. 42.
    A. A. Liddicoat and M. J. Flynn. High-performance floating point divide. In Proceedings of the Euromicro Symposium on Digital System Design, pages 354–361, Sept 2001.Google Scholar
  43. 43.
    P. Mattson. A Programming System for the Imagine Media Processor. PhD thesis, Stanford University, 2002.Google Scholar
  44. 44.
    P. Mattson, W. J. Dally, S. Rixner, U. J. Kapasi, and J. D. Owens. Communication scheduling. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 82–92, 2000.Google Scholar
  45. 45.
    MIPS Technologies. MIPS64 20Kc Core, 2004. http://www.mips.com/ProductCatalog/P_MIPS6420KcCore
  46. 46.
    NVIDIA®. NVIDIA’s Unified Architecture GeForce® 8 Series GPUs. http://www.nvidia.com/page/geforce8.html
  47. 47.
    J. D. Owens, W. J. Dally, U. J. Kapasi, S. Rixner, P. Mattson, and B. Mowery. Polygon rendering on a stream architecture. In HWWS ’00: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics hardware, pages 23–32, 2000.Google Scholar
  48. 48.
    J. D. Owens, B. Khailany, B. Towles, and W. J. Dally. Comparing reyes and OpenGL on a stream architecture. In HWWS ’02: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pages 47–56, 2002.Google Scholar
  49. 49.
    S. Rixner, W. J. Dally, U. J. Kapasi, B. Khailany, A. Lopez-Lagunas, P. R. Mattson, and J. D. Owens. A bandwidth-efficient architecture for media processing. In Proceedings of the 31st Annual IEEE/ACM International Symposium on Microarchitecture, Dallas, TX, November 1998.Google Scholar
  50. 50.
    S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. Memory access scheduling. In Proceedings of the 27th Annual International Symposium on Computer Architecture, June 2000.Google Scholar
  51. 51.
    S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. J. Kapasi, and J. D. Owens. Register organization for media processing. In Proceedings of the 6th International Symposium on High Performance Computer Architecture, Toulouse, France, Jan 2000.Google Scholar
  52. 52.
    Semiconductor Industry Association. The International Technology Roadmap for Semiconductors, 2005 Edition.Google Scholar
  53. 53.
    Texas Instruments. TMS320C6713 floating-point digital signal processor, datasheet SPRS186D, dec. 2001. http://focus.ti.com/lit/ds/symlink/tms320c6713.pdf, May 2003.
  54. 54.
    W. Thies, M. Karczmarek, and S. P. Amarasinghe. StreamIt: a language for streaming applications. In Proceedings of the 11th International Conference on Compiler Construction, pages 179–196, Apr 2002.Google Scholar
  55. 55.
    D. van der Spoel, A. R. van Buuren, E. Apol, P. J. Meulen -hoff, D. Peter Tieleman, A. L. T. M. Sij bers, B. Hess, K. Anton Feenstra, E. Lindahl, R. van Drunen, and H. J. C. Berendsen. Gromacs User Manual version 3.1. Nij enborgh 4, 9747 AG Groningen, The Netherlands. Internet: http://www.gromacs.org, 2001.

Copyright information

© Springer-Verlag US 2009

Authors and Affiliations

  1. 1.The University of Texas at AustinAustinUSA
  2. 2.Stanford UniversityStanfordUSA

Personalised recommendations