Data Pipeline Optimization for Shared Memory Multiple-SIMD Architecture

  • Weihua Zhang
  • Tao Bao
  • Binyu Zang
  • Chuanqi Zhu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4382)


The rapid growth of multimedia applications has been putting high pressure on the processing capability of modern processors, which leads to more and more modern multimedia processors employing parallel single instruction multiple data (SIMD) units to achieve high performance. In embedded system on chips (SOCs), shared memory multiple-SIMD architecture becomes popular because of its less power consumption and smaller chip size. In order to match the properties of some multimedia applications, there are interconnections among multiple SIMD units. In this paper, we present a novel program transformation technique to exploit parallel and pipelined computing power of modern shared-memory multiple-SIMD architecture. This optimizing technique can greatly reduce the conflict of shared data bus and improve the performance of applications with inherent data pipeline characteristics. Experimental results show that our method provides impressive speedup. For a shared memory multiple-SIMD architecture with 8 SIMD units, this method obtains more than 3.6X speedup for the multimedia programs.


Explosive Dispatch Allo 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Diefendorff, K., Dubey, P.K.: How multimedia workloads will change processor design. Computer, pp. 43-45 (Sept. 1997)Google Scholar
  2. 2.
    Rixner, S., Dally, W.J.: Register organization for media processing. In: 6th International Symposium on High-Performance Computer Architecture, pp. 375–386 (2000)Google Scholar
  3. 3.
    Singh, H., Lee, M.H., Bagherzadeh, N.: MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications. IEEE Transaction on Computers 49(5), 465–481 (2000)CrossRefGoogle Scholar
  4. 4.
    Wang, X., Ziavras, S.G.: A framework for dynamic resource assignment and scheduling on reconfigurable mixed-mode on-chip multiprocessors. In: IEEE International Conference on Field-Programmable Technology, pp. 51–58. IEEE Computer Society Press, Los Alamitos (2005)CrossRefGoogle Scholar
  5. 5.
    Khailany, B., et al.: Imagine: media processing with streams. IEEE Micro 21(2), 35–46 (2001)CrossRefGoogle Scholar
  6. 6.
  7. 7.
    Gayles, E.S., Kelliher, T.P., Irwin, M.J.: The Design of the MGAP-2: A Micro-Grained Massively Parallel Array. IEEE Transaction on Very Large Scale Integration(VLSI) Systems 8(6) (2000)Google Scholar
  8. 8.
    Komuro, T., Ishikawa, M.: A Dynamically Reconfigurable SIMD Processor for a Vision Chip. IEEE Journal of Solid-State Circuits 39(1) (2004)Google Scholar
  9. 9.
    Gebis, J., et al.: VIRAM1: A Media-Oriented Vector Processor with Embedded DRAM. In: 41st Design Automation Student Design Contenst, San Diego, CA, June (2004)Google Scholar
  10. 10.
    Hofstee, H.P.: Power Efficient Processor Architecture and The Cell Processor. In: 11th International Conference on High-Performance Computer Architecture, San Francisco, USA, February (2005)Google Scholar
  11. 11.
    Venkataramani, G., et al.: Automatic compilation to a coarse-grained reconfigurable system-opn-chip. ACM Transactions on Embedded Computing Systems (TECS) 2(Issue 4) (2003)Google Scholar
  12. 12.
    Mattson, P., et al.: Communication Scheduling. In: Proceedings of the ninth international conference on Architectural support for programming languages and operating systems, Nov. (2000)Google Scholar
  13. 13.
    Zhang, W., et al.: Optimizing Compiler for Shared-Memory Multiple SIMD Architecture. In: ACM SIGPLAN/SIGBED Conference on Languages, Ottawa, Canada, ACM, New York (2006)Google Scholar
  14. 14.
    Jiang, W.H., et al.: Boosting the Performance of Multimedia Applications Using SIMD Instructions. In: The 15th International Conference on Compiler Construction, Edinburgh, Scotland, April (2005)Google Scholar
  15. 15.
    Padua, D.A., Wolfe, M.J.: Advanced Compiler optimizations for Supercomputers. Communications of the ACM 29, 1184–1201 (1986)CrossRefGoogle Scholar
  16. 16.
    Muchnick, S.S.: Advanced Compiler Design and Implementation. Morgan Kaufmann, San Francisco (1997)Google Scholar
  17. 17.
    Capitanio, A., Dutt, N., Nicolau, A.: Partitioned register files for VLIWs: A preliminary analysis of trade-offs. In: Proceedings of the 25th Annual International Symposium on Microarchitecture, Dec., pp. 292–300 (1992)Google Scholar
  18. 18.
    Fernandes, M., Llosa, J., Topham, N.: Distributed modulo scheduling. In: Proceedings of the 5th Annual International Conference on High Performance Computer Architecture, Jan., pp. 130–134 (1999)Google Scholar
  19. 19.
    Wolf, M.E., Lam, M.S.: A Data Locality Optimizing Algorithm. In: ACM SIGPLAN Conference on Programming Language Designand Implementation, pp. 30–44. ACM Press, New York (1991)Google Scholar
  20. 20.
    Slingerland, N.T., Smith, A.J.: Multimedia Instruction Sets for General Purpose Microprocessors: A Survey.Technical Report CSD-00-, Univ. of California at Berkeley Computer Science, Dec.2000 (1122)Google Scholar
  21. 21.
    Talla, D., John, L.K., Burger, D.C.: Bottlenecks in Multimedia Processing with SIMD-Style Extensions and Architectural Enhancements. IEEE Transactions on Computers 52(8), 1015–1031 (2003)CrossRefGoogle Scholar

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Weihua Zhang
    • 1
    • 2
  • Tao Bao
    • 1
  • Binyu Zang
    • 1
  • Chuanqi Zhu
    • 1
  1. 1.Parallel Processing Institute, Fudan University, ShanghaiChina
  2. 2.Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences 

Personalised recommendations