Custom parallel caching schemes for hardware-accelerated image compression

  • Su-Shin Ang
  • George A. Constantinides
  • Wayne Luk
  • Peter Y. K. Cheung
Original Research Paper

Abstract

In an effort to achieve lower bandwidth requirements, video compression algorithms have become increasingly complex. Consequently, the deployment of these algorithms on field programmable gate arrays (FPGAs) is becoming increasingly desirable, because of the computational parallelism on these platforms as well as the measure of flexibility afforded to designers. Typically, video data are stored in large and slow external memory arrays, but the impact of the memory access bottleneck may be reduced by buffering frequently used data in fast on-chip memories. The order of the memory accesses, resulting from many compression algorithms are dependent on the input data (Jain in Proceedings of the IEEE, pp. 349–389, 1981). These data-dependent memory accesses complicate the exploitation of data re-use, and subsequently reduce the extent to which an application may be accelerated. In this paper, we present a hybrid memory sub-system which is able to capture data re-use effectively in spite of data-dependent memory accesses. This memory sub-system is made up of a custom parallel cache and a scratchpad memory. Further, the framework is capable of exploiting 2D spatial locality, which is frequently exhibited in the access patterns of image processing applications. In a case study involving the quad-tree structured pulse code modulation (QSDPCM) application, the impact of data dependence on memory accesses is shown to be significant. In comparison with an implementation which only employs an SPM, performance improvements of up to 1.7× and 1.4× are observed through actual implementation on two modern FPGA platforms. These performance improvements are more pronounced for image sequences exhibiting greater inter-frame movements. In addition, reductions of on-chip memory resources by up to 3.2× are achievable using this framework. These results indicate that, on custom hardware platforms, there is substantial scope for improvement in the capture of data re-use when memory accesses are data dependent.

Keywords

Cache Scratchpad Data re-use Arbitration Hardware 

References

  1. 1.
    Altera, stratix 2 datasheet. http://www.altera.com/literature/hb/stx2/stx2_sii51002.pdf. Accessed 10 July 2007
  2. 2.
    Absar, M., Catthoor, F.: Compiler-based approach for exploiting scratch-pad in presence of irregular array access. In: Proceedings of the Design, Automation and Test in Europe, pp. 1162–1167 (2005)Google Scholar
  3. 3.
    Aho, J.: A quick guide to digital video resolution. http://lipas.uwasa.fi/~f76998/video/conversion/. Accessed 10 July 2007
  4. 4.
    Balasubramonian, R., Albonesi, D., Buyuktosunoglu, A., Dwarkadas, S.: A dynamically tunable memory hierarchy. IEEE Trans. Comput. 52, 1243–1257 (2003)CrossRefGoogle Scholar
  5. 5.
    Celoxica, RC250 board specifications. http://www.celoxica.com/products/rc250/default.asp. Accessed 10 July 2007
  6. 6.
    Celoxica, RC300 board specifications. http://www.celoxica.com/techlib/files/CEL-W040216143F-257.pdf. Accessed 10 July 2007
  7. 7.
    Celoxica website. http://www.celoxica.com. Accessed 11 Jan 2007
  8. 8.
    Cohen, E., Lewis, D.: Approximating matrix multiplication for pattern recognition tasks. J. Algorithms 30, 211–252 (1999)MATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Danckaert, K., Catthoor, F., Man, H.D.: System level memory optimization for hardware-software co-design. In: Proceedings of the Fifth International Workshop on Hardware/Software Codesign, pp. 55–59 (1997)Google Scholar
  10. 10.
    Danckaert, K., Masselos, K., Catthoor, F., Man, H.D.: Strategy for power efficient combined task and data parallelism exploration illustrated on a QSDPCM video codec. EUROMICRO J. Syst. Arch. 45(10), 791–808 (1999)CrossRefGoogle Scholar
  11. 11.
    Dhodapkar, A., Smith, J.: Managing multi-configuration hardware via dynamic working set analysis. In: Proceedings of the International Symposium Computer Architecture, pp. 233–244 (2002)Google Scholar
  12. 12.
    Edmondson, J., Rubinfield, P., Bannon, P., Benschneider, B., Berstein, D., Castelino, R., Cooper, E., Dever, D., Donchin, D., Fischer, T., Jain, A., Mehta, S., Meyer, J., Preston, R., Rajagopalan, V., Somanathan, C., Taylor, S., Wolrich, G.: Internal organization of the Alpha 21164 a 300 MHz 64-bit quad-issue CMOS RISC microprocessor. Digit. Tech. J. 7(1), 119–135 (1995)Google Scholar
  13. 13.
    Galanis, M., Dimitroulakos, G., Kakarountas, A., Goutis, C.: Speedups from partitioning software kernels to FPGA hardware in embedded SoCs. In: Proceedings of the IEEE Workshop on Signal Processing Systems Design and Implementation, pp. 485–490 (2005)Google Scholar
  14. 14.
    Guo, Z., Buyukkurt, B., Najjar, W., Vissers, K.: Optimized generation of data-paths from C codes for FPGAs. In: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 112–118 (2005)Google Scholar
  15. 15.
    Henessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach. Morgan Kaufmann, Menlo Park, CA (1996)Google Scholar
  16. 16.
    Jackson, D.J., Ren, H., Wu, X., Ricks, K.G.: A hardware architecture for real-time image compression using a searchless fractal image coding method. J. Real Time Image Process. 1(3):225–237 (2007)CrossRefGoogle Scholar
  17. 17.
    Jain, A.: Image data compression: a review. In: Proceedings of the IEEE, pp. 349–389 (1981)Google Scholar
  18. 18.
    Kim, C., Burger, D., Keckler, S.: An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: Proceedings of the International Conference Architectural Support for Program, Languages and Operating System, pp. 211–222 (2002)Google Scholar
  19. 19.
    Kulkarni, C., Catthoor, F., Man, H.D.: Hardware cache optimization for parallel multimedia applications. In: Proceedings of the Euro-Par’98 Parallel Processing, pp. 923–932 (1998)Google Scholar
  20. 20.
    Kulkarni, C., Catthoor, F., Man, H.: Data and memory optimization techniques for embedded systems. In: Proceedings of the IPDPS Workshops on Parallel and Distributed Processing, pp. 186–193 (2000)Google Scholar
  21. 21.
    Liu, Q., Constantinides, G.A., Masselos, K., Cheung, P.Y.K. (2007) Automatic on-chip memory minimization for data reuse. In: Proceedings of the IEEE Symposium on Field-Programmable Custom Computing MachinesGoogle Scholar
  22. 22.
    Masselos, K., Catthoor, F., Goutis, C., DeMan, H.: Low power mapping of video processing applications on VLIW multimedia processors. In: Proceedings of the IEEE Alessandro Volta Memorial International Workshop on Low Power Design, pp. 52–60 (1999)Google Scholar
  23. 23.
    Page, I., Luk, W.: Compiling Occam into FPGAs. In: Proceedings of the Field-Programmable Logic and Applications, pp. 271–283 (1991)Google Scholar
  24. 24.
    Ranganathan, P., Adve, S., Jouppi, N.: Reconfigurable caches and their application to media processing. In: Proceedings of the International Symposium Computer Architecture, pp. 214–224 (2000)Google Scholar
  25. 25.
    Sohi, G., Franklin, M.: High-bandwidth data memory systems for superscalar processors. In: Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 53–62 (1991)Google Scholar
  26. 26.
    Strobach, P.: Tree-structured scene adaptive coder. IEEE Trans. Commun. 38(4):477–486 (1990)CrossRefGoogle Scholar
  27. 27.
    Venkataramani, G., Chelcea, T., Goldstein, S.C., Bjerregaard, T.: Soma: a tool for synthesizing and optimizing memory accesses in asics. In: Proceedings of the 3rd IEEE International Conference on Hardware/software Codesign and System Synthesis, pp. 231–236 (2005)Google Scholar
  28. 28.
    Xilinx, virtex 2 datasheet. http://www.xilinx.com/partinfo/ds031.pdf. Accessed 10 July 2007
  29. 29.
    Zhang, C., Vahid, F., Najjar, W.: A highly configurable cache architecture for embedded systems. In: Proceedings of the International Symposium Computer Architecture, pp. 136–146 (2003)Google Scholar

Copyright information

© Springer-Verlag 2008

Authors and Affiliations

  • Su-Shin Ang
    • 1
  • George A. Constantinides
    • 1
  • Wayne Luk
    • 2
  • Peter Y. K. Cheung
    • 1
  1. 1.Department of Electrical and Electronic EngineeringImperial College LondonLondonUK
  2. 2.Department of ComputingImperial CollegeLondonUK

Personalised recommendations