Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Loop Distribution and Fusion with Timing and Code Size Optimization

  • 128 Accesses

Abstract

In this paper, a technique that combines loop distribution with maximum direct loop fusion (LD_MDF) is proposed. The technique performs maximum loop distribution, followed by maximum direct loop fusion to optimize timing and code size simultaneously. The loop distribution theorems that state the conditions distributing any multi-level nested loop in the maximum way are proved. It is proved that the statements involved in the dependence cycle can be fully distributed if the summation of the edge weight of the dependence cycle satisfies a certain condition; otherwise, the statements should be put in the same loop after loop distribution. Based on the loop distribution theorems, algorithms are designed to conduct maximum loop distribution. The maximum direct loop fusion problem is mapped to the graph partitioning problem. A polynomial graph partitioning algorithm is developed to compute the fusion partitions. It is proved that the proposed maximum direct loop fusion algorithm produces the fewest number of resultant loop nests without violating dependence constraints. It is also shown that the resultant code size of the fused loops by the technique of loop distribution with maximum direct loop fusion is smaller than the code size of the original loops when the number of fused loops is less than the number of the original loops. The simulation results are presented to validate the proposed technique.

This is a preview of subscription content, log in to check access.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17

References

  1. 1.

    Chandra, D., Guo, F., Kim, S., & Solihin, Y. (2005). Predicting inter-thread cache contention on a chip multi-processor architecture. In HPCA ’05: Proceedings of the 11th international symposium on high-performance computer architecture (pp. 340–351). Washington, DC: IEEE Computer Society.

  2. 2.

    Manjikian, N., & Abdelrahman, T. S. (1997). Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed System, 8, 193–209.

  3. 3.

    McKinley, K.S., Carr, S., & Tseng, C.-W. (1996). Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems (TOPLAS), 18(4), 424–453.

  4. 4.

    Kennedy, K., & Mckinley, K. S. (1992). Optimizing for parallelism and data locality. In Proc. of the 6th conference on supercomputing (pp. 323–334).

  5. 5.

    Wolf, M. & Lam, M. (1991). A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4), 452–471.

  6. 6.

    Darte, A., Schreiber, R., & Villard, G. (2005). Lattice-based memory allocation. IEEE Transactions on Computers, 54, 1242–1257.

  7. 7.

    Burger, D., Goodman, J. R., & Kägi, A. (1996). Memory bandwidth limitations of future microprocessors. In ISCA ’96: Proceedings of the 23rd annual international symposium on computer architecture (pp. 78–89). New York: ACM.

  8. 8.

    Hu, Q., Palkovic, M., & Kjeldsberg, P. G. (2004). Memory requirement optimization with loop fusion and loop shifting. In DSD ’04: Proceedings of the digital system design, EUROMICRO systems (pp. 272–278). Washington, DC: IEEE Computer Society.

  9. 9.

    Panda, P. R., Catthoor, F., Dutt, N. D., Danckaert, K., Brockmeyer, E., Kulkarni, C., et al. (2001). Data and memory optimization techniques for embedded systems. ACM Transactions on Design Automation of Electronic Systems (TODAES), 6(2), 149–206.

  10. 10.

    Catthoor, F., de Greef, E., & Suytack, S. (1998). Custom memory management methodology: Exploration of memory organisation for embedded multimedia system design. Norwell: Kluwer Academic.

  11. 11.

    Wang, Z., Hu, S., & Sha, E. H.-M. (2003). Register aware scheduling for distributed cache clustered architecture. In Proc. IEEE/ACM 2003 ASP design automation conference, Kitakyusyu, Japan.

  12. 12.

    Chen, F., O’Neil, T. W., & Sha, E. H.-M. (2000). Optimizing overall loop schedules using prefetching and partitioning. IEEE Transactions on Parallel and Distributed Systems, 11(6), 604–614.

  13. 13.

    Wolfe, M. (1996). High performance compilers for parallel computing. Reading: Addison-Wesley.

  14. 14.

    Allen, R., & Kennedy, K. (2001). Optimizing compilers for modern architectures: A dependence-based approach. San Francisco: Morgan Kaufmann.

  15. 15.

    Kennedy, K., & Mckinley, K. S. (1990). Loop distribution with arbitrary control flow. In Proc. of the 1990 conference on supercomputing (pp. 407–416).

  16. 16.

    Kennedy, K., & Mckinley, K. S. (1993). Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, 768, pp. 301–320.

  17. 17.

    Sha, E. H.-M., O’Neil, T. W., & Passos, N. L. (2003). Efficient polynomial-time nested loop fusion with full parallelism. International Journal of Computers and Their Applications, 10(1), 9–24.

  18. 18.

    Abdelrahman, T. S., Sawaya, R. (2000). Increasing perfect nests in scientific programs. In Proc. of international conference on parallel and distributed computing and systems (pp. 279–285). Las Vegas, NV.

  19. 19.

    Yi, Q., Kennedy, K., & Adve, V. (2004). Transforming complex loop nests for locality. The Journal Of Supercomputing, 27(3), 219–264.

  20. 20.

    Yi, Q., & Kennedy, K. (2004). Improving memory hierarchy performance through combined loop interchange and multi-level fusion. International Journal of High Performance Computing Applications, 18(2), 237–253.

  21. 21.

    Liu, M., Zhuge, Q., Shao, Z., & Sha, E. H.-M. (2004). General loop fusion technique for nested loops considering timing and code size. In Proc. ACM/IEEE international conference on compilers, architectures, and synthesis for embedded systems (CASES 2004) (pp. 190–201).

  22. 22.

    Darte, A. (1999). On the complexity of loop fusion. In International conference on parallel architectures and compilation techniques (pp. 149–157).

  23. 23.

    Verdoolaege, S., Bruynooghe, M., & Catthoor, F. (2003). Multi-dimensional incremental loop fusion for data locality. In Proc. of the application-specific systems, architectures, and processors (pp. 14–24).

  24. 24.

    Cooper, K. D., Torczon, L. (2008). Engineering a compiler. San Francisco: Morgan Kaufmann.

Download references

Author information

Correspondence to Meilin Liu.

Additional information

This work is partially supported by NSF CCR-0309461, NSF IIS-0513669, WSU-666781, WSU-282025.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Liu, M., Sha, E.H.-., Zhuge, Q. et al. Loop Distribution and Fusion with Timing and Code Size Optimization . J Sign Process Syst 62, 325–340 (2011). https://doi.org/10.1007/s11265-010-0465-x

Download citation

Keywords

  • Loop distribution
  • Loop fusion
  • Graph partitioning
  • Code size
  • Embedded DSP