Mapping Streaming Languages to General Purpose Processors through Vectorization

  • Raymond Manley
  • David Gregg
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5898)


Streaming languages were originally aimed at streaming architectures, but recent work has shown the stream programming model to be useful in exploiting parallelism on general purpose processors. Current research in mapping stream code onto GPPs deals with load balancing and generating threads based on hardware features. We look into improving problems associated with stream data locality and stream data parallelism on GPPs. We suggest that automatically generating vectorized code for these streaming operations is a potential solution. We use the Brook stream language as our syntax base and augment it to generate vector intrinsics targeting the x86 architecture. This compiler uses both existing and new strategies to transform high-level streaming kernel code into vector instructions without requiring additional annotations. We compare our system’s results to existing mapping strategies aimed at using stream code on GPPs. When evaluating performance, we see a wide range of speedups from a few percent to over 2x and discuss the level of effectiveness of using vector code over scalar equivalents in specific application domains.


vectorization streaming languages optimization 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Owens, J.D., Rixner, S., Kapasi, U.J., Mattson, P., Towles, B., Serebrin, B., Dally, W.J.: Media processing applications on the imagine stream processor. In: International Conference on Computer Design, p. 295 (2002)Google Scholar
  2. 2.
    Taylor, M.B., Lee, W., Miller, J., Wentzlaff, D., Bratt, I., Greenwald, B., Hoffmann, H., Johnson, P., Kim, J., Psota, J., Saraf, A., Shnidman, N., Strumpen, V., Frank, M., Amarasinghe, S., Agarwal, A.: Evaluation of the raw microprocessor: An exposed-wire-delay architecture for ILP and streams. In: ISCA 2004: Proceedings of the 31st annual international symposium on Computer architecture, Washington, DC, USA, vol. 2. IEEE Computer Society, Los Alamitos (2004)Google Scholar
  3. 3.
    Khailany, B., Dally, W.J., Kapasi, U.J., Mattson, P., Namkoong, J., Owens, J.D., Towles, B., Chang, A., Rixner, S.: Imagine: Media processing with streams. IEEE Micro 21(2), 35–46 (2001)CrossRefGoogle Scholar
  4. 4.
    Zhang, X.D.: A streaming computation framework for the cell processor. M. eng. thesis, Massachusetts Institute of Technology, Cambridge, MA (August 2007)Google Scholar
  5. 5.
    Zhang, X.D., Li, Q.J., Rabbah, R., Amarasinghe, S.: A lightweight streaming layer for multicore execution. In: Workshop on Design, Architecture and Simulation of Chip Multi-Processors, Chicago, IL (December 2007)Google Scholar
  6. 6.
    Amarasinghe, S.: StreamIt A Programming Language for the Era of Multicores (November 2006)Google Scholar
  7. 7.
    Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., Hanrahan, P.: Brook for GPUs: stream computing on graphics hardware. In: SIGGRAPH 2004: ACM SIGGRAPH 2004 Papers, pp. 777–786. ACM, New York (2004)CrossRefGoogle Scholar
  8. 8.
    Gummaraju, J., Rosenblum, M.: Stream programming on general-purpose processors. In: MICRO 38: Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, Washington, DC, USA, pp. 343–354. IEEE Computer Society, Los Alamitos (2005)Google Scholar
  9. 9.
    Gummaraju, J., Erez, M., Coburn, J., Rosenblum, M., Dally, W.J.: Architectural support for the stream execution model on general-purpose processors. In: PACT 2007: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, Washington, DC, USA, pp. 3–12. IEEE Computer Society, Los Alamitos (2007)Google Scholar
  10. 10.
    Talla, D., John, L.K., Burger, D.: Bottlenecks in multimedia processing with SIMD style extensions and architectural enhancements. IEEE Trans. Comput. 52(8), 1015–1031 (2003)CrossRefGoogle Scholar
  11. 11.
    Kudlur, M., Mahlke, S.: Orchestrating the execution of stream programs on multicore platforms. In: PLDI 2008: Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation, pp. 114–124. ACM, New York (2008)CrossRefGoogle Scholar
  12. 12.
    Gummaraju, J., Coburn, J., Turner, Y., Rosenblum, M.: Streamware: programming general-purpose multicore processors using streams. SIGOPS Oper. Syst. Rev. 42(2), 297–307 (2008)CrossRefGoogle Scholar
  13. 13.
    wei Liao, S., Du, Z., Wu, G., Lueh, G.Y.: Data and computation transformations for brook streaming applications on multiprocessors. In: CGO 2006: Proceedings of the International Symposium on Code Generation and Optimization, Washington, DC, USA, pp. 196–207. IEEE Computer Society, Los Alamitos (2006)CrossRefGoogle Scholar
  14. 14.
    Thies, W., Karczmarek, M., Amarasinghe, S.P.: Streamit: A language for streaming applications. In: Horspool, R.N. (ed.) CC 2002. LNCS, vol. 2304, pp. 179–196. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  15. 15.
    Das, A., Dally, W.J., Mattson, P.: Compiling for stream processing. In: PACT 2006: Proceedings of the 15th international conference on Parallel architectures and compilation techniques, pp. 33–42. ACM, New York (2006)CrossRefGoogle Scholar
  16. 16.
    Amarasinghe, S., Gordon, M.I., Karczmarek, M., Lin, J., Maze, D., Rabbah, R.M., Thies, W.: Language and compiler design for streaming applications. Int. J. Parallel Program. 33(2), 261–278 (2005)CrossRefGoogle Scholar
  17. 17.
    Advanced Micro Devices, Inc.: AMD Brook+ (November 2007),
  18. 18.
    Nuzman, D., Zaks, A.: Autovectorization in GCC - two years later. In: GCC Summit (June 2006)Google Scholar
  19. 19.
    Naishlos, D.: Autovectorization in GCC. In: GCC Summit (June 2004)Google Scholar
  20. 20.
    Intel Corp.: Intel(R) C++ Compiler Intrinsics Reference (2007)
  21. 21.
    Intel Corp.: Intel(R) 64 and IA-32 Architectures Optimization Reference Manual (2007),
  22. 22.
    Mucci, P.J.: PapiEx - Execute arbitrary application and measure hardware performance counters with PAPI (2009),
  23. 23.
    Nuzman, D., Rosen, I., Zaks, A.: Auto-vectorization of interleaved data for SIMD. SIGPLAN Not. 41(6), 132–143 (2006)CrossRefGoogle Scholar
  24. 24.
    Stratton, J., Stone, S., mei Hwu, W.: MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In: Amaral, J.N. (ed.) LCPC 2008. LNCS, vol. 5335, pp. 16–30. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  25. 25.
  26. 26.
    Krall, A., Lelait, S.: Compilation techniques for multimedia processors. International Journal of Parallel Programming 28, 347–361 (2000)CrossRefGoogle Scholar
  27. 27.
    Allen, R., Kennedy, K.: Automatic translation of FORTRAN programs to vector form. ACM Transactions on Programming Languages and Systems 9, 491–542 (1987)zbMATHCrossRefGoogle Scholar
  28. 28.
    Ren, G., Wu, P., Padua, D.: A preliminary study on the vectorization of multimedia applications for multimedia extensions. In: Rauchwerger, L. (ed.) LCPC 2003. LNCS, vol. 2958, pp. 420–435. Springer, Heidelberg (2004)Google Scholar
  29. 29.
    Larsen, S., Rabbah, R., Amarasinghe, S.: Exploiting vector parallelism in software pipelined loops. In: MICRO 38: Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, Washington, DC, USA, pp. 119–129. IEEE Computer Society, Los Alamitos (2005)Google Scholar
  30. 30.
    Nuzman, D., Henderson, R.: Multi-platform auto-vectorization. In: CGO 2006: Proceedings of the International Symposium on Code Generation and Optimization, Washington, DC, USA, pp. 281–294. IEEE Computer Society, Los Alamitos (2006)CrossRefGoogle Scholar
  31. 31.
    Intel Corp.: Intel(R) Advanced Vector Extensions Programming Reference (2008),

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Raymond Manley
    • 1
  • David Gregg
    • 1
  1. 1.Trinity College DublinDublinIreland

Personalised recommendations