Advertisement

Journal of Signal Processing Systems

, Volume 53, Issue 3, pp 301–321 | Cite as

Guidance of Loop Ordering for Reduced Memory Usage in Signal Processing Applications

  • Per Gunnar KjeldsbergEmail author
  • Francky Catthoor
  • Sven Verdoolaege
  • Martin Palkovic
  • Arnout Vandecappelle
  • Qubo Hu
  • Einar J. Aas
Article

Abstract

Data dominated signal processing applications are typically described using large and multi-dimensional arrays and loop nests. The order of production and consumption of array elements in these loop nests has huge impact on the amount of memory required during execution. This is essential since the size and complexity of the memory hierarchy is the dominating factor for power, performance and chip size in these applications. This paper presents a number of guiding principles for the ordering of the dimensions in the loop nests. They enable the designer, or design tools, to find the optimal ordering of loop nest dimensions for individual data dependencies in the code. We prove the validity of the guiding principles when no prior restrictions are given regarding fixation of dimensions. If some dimensions are already fixed at given nest levels, this is taken into account when fixing the remaining dimensions. In most cases an optimal ordering is found for this situation as well. The guiding principles can be used in the early design phases in order to enable minimization of the memory requirement through in-place mapping. We use real life examples to show how they can be applied to reach a cost optimized end product. The results show orders of magnitude improvement in memory requirement compared to using the declared array sizes, and similar penalties for choosing the suboptimal ordering of loops when in-place mapping is exploited.

Keywords

Memory optimization Memory architecture exploration High level synthesis Code transformation Multi-media 

Abbreviations

DP

dependency part

LR

length ratio

DV

dependency vector

ND

nonspanning dimension

DVP

dependency vector polytope

SD

spanning dimension

ID

iteration domain

UB

upper bound

LB

lower bound

References

  1. 1.
    Catthoor, F., Wuytack, S., De Greef, E., Balasa, F., Nachtergaele, L., & Vandecappelle, A. (1998). Custom memory management methodology—Exploration of memory organisation for embedded multimedia system design. Boston, USA: Kluwer.zbMATHGoogle Scholar
  2. 2.
    Catthoor, F., Danckaert, K., Kulkarni, K. K., Brockmeyer, E., Kjeldsberg, P. G., van Achteren, T., et al. (2002). Data access and storage management for embedded programmable processors. Boston, USA: Kluwer.zbMATHGoogle Scholar
  3. 3.
    Banerjee, U. (1988). Dependence analysis for supercomputing. Boston, USA: Kluwer.Google Scholar
  4. 4.
    Allen, J. R., & Kennedy, K. (1984). Automatic loop inter change. Proc. of the SIGPLAN’84 symposium on compiler construction, SIGPLAN Notices (Vol. 19, pp. 233–246) (June).CrossRefGoogle Scholar
  5. 5.
    Pugh, W., & Wonnacott, D. (1993). An exact method for analysis of value-based array data dependences. In Proc. 6th intnl. wsh. on languages and compilers for parallel computing. Portland OR, USA, (pp. 546–566) (August).Google Scholar
  6. 6.
    Vanbroekhoven, P., Janssens, G., Bruynooghe, M., Corporaal, H., & Catthoor, F. (2005). Transformation to dynamic single assignment using a simple data flow analysis. In Proc. 3rd Asian symp. on programming languages and systems, APLAS’05, (Tsukuba, Japan), vol. 3780 of Lecture Notes Comp. Sc., Springer Verlag (pp. 330–346) (November).Google Scholar
  7. 7.
    Palkovic, M., Brockmeyer, E., Vanbroekhoven, P., Corporaal, H., & Catthoor, F. (2005). Systematic preprocessing of data dependent constructs for embedded systems. In Proc. 15th intnl. wsh. on integrated circuit and system design, power and timing modeling, optimization and simulation (PATMOS), IEEE. Leuven, Belgium (pp. 89–98) (September).Google Scholar
  8. 8.
    Verbauwhede, I., Catthoor, F., Vandewalle, J., & De Man, H. (1989). Background memory management for the synthesis of algebraic algorithms on multi-processor dsp chips. In Proc. VLSI’89, intnl. conf. on VLSI. Munich, Germany (pp. 209–218) (August).Google Scholar
  9. 9.
    Wolf, M. E., & Lam, M. S. (1991). A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems, 2, 452–471) (October).CrossRefGoogle Scholar
  10. 10.
    Kennedy, K., & McKinley, K. S. (1992). Optimizing for parallelism and data locality. In Proc. of the 6th international conference on supercomputing. Washington, DC, USA (pp. 323–334) (August).Google Scholar
  11. 11.
    Clauss, P., & Loechner, V. (1998). Parametric analysis of polyhedral iteration spaces. Journal of VLSI Signal Processing, 19, 179–194 (July).CrossRefGoogle Scholar
  12. 12.
    Allen, R. & Kennedy, K. (2002). Optimizing compilers for modern architectures. San Francisco, USA: KMorgan Kaufmann.Google Scholar
  13. 13.
    McKinley, K. S., Carr, S., & Tseng, C.-W. (1996). Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 18, 424–453 (July).CrossRefGoogle Scholar
  14. 14.
    Danckaert, K., Catthoor, F., & De Man, H. (2000). A loop transformation approach for combined parallelization and data transfer and storage optimization. In Proc. ACM conf. on par. and dist. proc. techniques and applications, PDPTA’00. Las Vegas NV, USA (pp. 2591–2597) (June).Google Scholar
  15. 15.
    Verdoolaege, S., Bruynooghe, M., Janssens, G., & Catthoor, F. (2003). Multi-dimensional incremental loop fusion for data locality. In Proc. IEEE international conference on application-specific systems, architectures, and processors, ASAP’03. Leiden, The Netherlands (pp. 17–27) (June).Google Scholar
  16. 16.
    De Greef, E., Catthoor, F., & De Man, H. (1997). Array placement for storage size reduction in embedded multimedia systems. In Proc. intnl. conf. on applic.-spec. systems arch. and processors. Zurich, Switzerland (pp. 66–75) (July).Google Scholar
  17. 17.
    Lefebvre, V., & Feautrier, P. (1997). Optimizing storage size for static control programs in automatic parallelizers. In Proc. EuroPar conf., vol. 1300 of Lecture notes in computer science. Springer Verlag, Passau, Germany (pp. 356–363) (August).Google Scholar
  18. 18.
    Quillere, F., & Rajopadhye, S. (2000). Optimizing memory usage in the polyhedral model. ACM Transactions on Programming Languages and Systems, 22, 773–815 (September).CrossRefGoogle Scholar
  19. 19.
    Darte, A., Schreiber, R., & Villard, G. (2005). Lattice-based memory allocation. IEEE Transactions on Computers, 54, 1242–1257 (October).CrossRefGoogle Scholar
  20. 20.
    Chakrabarti, C. (2001). Cache design and exploration for low power embedded systems. In Proc. intnl. conf. on performance, computing, and communications, IEEE. Phoenix, Arizona, USA (pp. 135–139) (April).Google Scholar
  21. 21.
    Kandemir, M., Ramanujam, J., Irwin, M. J., Vijaykrishnan, N., Kadayif, I., & Parikh, A. (2004). A compiler-based approach for dynamically managing scratch-pad memories in embedded systems. IEEE Transactions on Computer-Aided Design, 23, 243–260 (February).CrossRefGoogle Scholar
  22. 22.
    Kirovski, D., Lee, C., Potkonjak, M., & Mangione-Smith, W. H. (1999). Application-driven synthesis of memory-intensive systems-on-chip. IEEE Transactions on Computer-Aided Design, 18, 1316–1326 (September).CrossRefGoogle Scholar
  23. 23.
    Panda, P. R., Dutt, N. D., & Nicolau, A. (1999). Local memory exploration and optimization in embedded systems. IEEE Transactions on Computer-Aided Design, 18, 3–13 (January).CrossRefGoogle Scholar
  24. 24.
    Kurdahi, F., & Parker, A. (1987). Real: A program for register allocation. In Proc. 24th ACM/IEEE design automation conf. Miami FL, USA (pp. 210–215) (June).Google Scholar
  25. 25.
    Ohm, S. Y., Kurdahi, F. J., & Dutt, N. (1994). Comprehensive lower bound estimation from behavioral description. In IEEE/ACM Intnl. Conf. on Computer-Aided Design, IEEE. San Jose CA, USA (pp. 182–187) (November).Google Scholar
  26. 26.
    Paulin, P. G., & Knight, J. P. (1989). Force-directed scheduling for the behavioral synthesis of asics. IEEE Transactions on Computer-Aided Design, 8, 661–679 (June).CrossRefGoogle Scholar
  27. 27.
    Tseng, C.-J., & Siewiorek, D. (1986). Automated synthesis of data paths in digital systems. IEEE Transactions on Computer-Aided Design, 5, 379–395 (July).CrossRefGoogle Scholar
  28. 28.
    Gebotys, C. H., & Elmasry, M. I. (1991). Simultaneous scheduling and allocation for cost constrained optimal architectural synthesis. In Proc. of the 28th ACM/IEEE design automation conf. San Jose CA, USA (pp. 2–7) (November).Google Scholar
  29. 29.
    Verbauwhede, I., Scheers, C., & Rabaey, J. (1994). Memory estimation for high-level synthesis. In Proc. 31st ACM/IEEE design automation conf. San Diego CA, USA (pp. 143–148) (June).Google Scholar
  30. 30.
    Grun, P., Balasa, F., & Dutt, N. (1998). Memory size estimation for multimedia applications. In Proc. ACM/IEEE wsh. on hardware/software co-design (Codes). Seattle WA, USA (pp. 145–149) (March).Google Scholar
  31. 31.
    Zhao, Y., & Malik, S. (1999). Exact memory size estimation for array computation without loop unrolling. In 36th ACM/IEEE design automation conf. New Orleans, USA (pp. 811–816) (June).Google Scholar
  32. 32.
    Ramanujam, J., Hong, J., Kandemir, M., & Narayan, A. (2001). Reducing memory requirements of nested loops for embedded systems. In 38th ACM/IEEE design automation conf. Las Vegas NV, USA (pp. 359–364) (June).Google Scholar
  33. 33.
    Balasa, F., Catthoor, F., & De Man, H. (1995). Background memory area estimation for multi-dimensional signal processing systems. IEEE Transactions on VLSI Systems, 3, 157–172 (June).CrossRefGoogle Scholar
  34. 34.
    Balasa, F., Zhu, H., & Luican, I. (2007). Computation of storage requirements for multi-dimensional signal processing applications. IEEE Transactions on VLSI Systems, 15, 447–460 (April).CrossRefGoogle Scholar
  35. 35.
    Hu, Q., Vandecappelle, A., Palkovic, M., Kjeldsberg, P. G., Brockmeyer, E., & Catthoor, F. (2006). Hierarchical memory size estimation for loop fusion and loop shifting in data-dominated applications. In Proc. of the 11th Asia and South Pacific design automation conference, ASP-DAC 2006. Yokohama, Japan (pp. 606–611) (January).Google Scholar
  36. 36.
    Smailagic, A. (Guest ed.) (2001). Special issue on system level design. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 9 (December).Google Scholar
  37. 37.
    Kjeldsberg, P. G., Catthoor, F., & Aas, E. J. (2003). Detection of partially simultaneously alive signals in storage requirement estimation for data-intensive applications. IEEE Transactions on Computer-Aided Design, 22, 908–921 (July).CrossRefGoogle Scholar
  38. 38.
    Kjeldsberg, P. G., Catthoor, F., & Aas, E. J. (2004). Storage requirement estimation for optimized design of data intensive applications. ACM Transactions on Design Automation of Electronic Systems, 9, 133–158 (April).CrossRefGoogle Scholar
  39. 39.
    van Swaaij, M. Franssen, F., Catthoor, F., & De Man, H. (1992). Modeling data flow and control flow for high level memory management. In Proc. of the European conference on design automation. Brussels, Belgium (pp. 8–13) (March).Google Scholar
  40. 40.
    De Greef, E., Catthoor, F., & De Man, H. (1997). Memory size reduction through storage order optimization for embedded parallel multimedia applications. Elsevier Parallel Computing Journal, 23, 1811–1837 (December).zbMATHCrossRefGoogle Scholar
  41. 41.
    Shang, W., Hodzic, E., & Chen, Z. (1996). On uniformization of affine dependence algorithms. IEEE Transactions on Computers, 45, 827–840 (July).zbMATHCrossRefMathSciNetGoogle Scholar
  42. 42.
    Knuth, D. (1997). The art of computer programming, volume 3: Sorting and searching, third edition. Addison-Wesley.Google Scholar
  43. 43.
    IMEC (2007). Atomium web site. http://www.imec.be/design/atomium/.
  44. 44.
    Kjeldsberg, P. G., Catthoor, F., & Aas, E. J. (2001). Detection of partially simultaneously alive signals in storage requirement estimation for data-intensive applications. In 38th ACM/IEEE design automation conf. Las Vegas N, USA (pp. 365–370) (June).Google Scholar
  45. 45.
    Kulkarni, D., & Stumm, M. (1993). Loop and data transformations: A tutorial. Tech. Rep. CSRI-337, Computer Systems Research Inst., Univ. of Toronto (June).Google Scholar
  46. 46.
    Moonen, M., Dooren, P. V., & Vandewalle, J. (1992). An svd updating algorithm for subspace tracking. SIAM Journal on Matrix Analysis and Applications, 13(4), 1015–1038.zbMATHCrossRefMathSciNetGoogle Scholar
  47. 47.
    Danckaert, K., Catthoor, F., & De Man, H. (2000). A preprocessing step for global loop transformations for data transfer and storage optimization. In Proc. intnl. conf. on compilers, arch. and synth. for emb. sys. San Jose, CA, USA (pp. 34–40) (November).Google Scholar
  48. 48.
    Wuytack, S., Diguet, J. P., Catthoor, F., & De Man, H. (1998). Formalized methodology for data reuse exploration for low-power hierarchical memory mappings. IEEE Transactions on VLSI Systems, 6, 529–537 (December).CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Per Gunnar Kjeldsberg
    • 1
    Email author
  • Francky Catthoor
    • 2
    • 3
  • Sven Verdoolaege
    • 3
    • 4
  • Martin Palkovic
    • 2
  • Arnout Vandecappelle
    • 2
  • Qubo Hu
    • 1
  • Einar J. Aas
    • 1
  1. 1.Norwegian University of Science and TechnologyTrondheimNorway
  2. 2.IMECLeuvenBelgium
  3. 3.Katholieke Universiteit LeuvenLeuvenBelgium
  4. 4.Leiden Institute of Advanced Computer ScienceLeidenThe Netherlands

Personalised recommendations