Skip to main content

Aligned Scheduling: Cache-Efficient Instruction Scheduling for VLIW Processors

  • 595 Accesses

Part of the Lecture Notes in Computer Science book series (LNTCS,volume 8664)

Abstract

The performance of statically scheduled VLIW processors is highly sensitive to the instruction scheduling performed by the compiler. In this work we identify a major deficiency in existing instruction scheduling for VLIW processors. Unlike most dynamically scheduled processors, a VLIW processor with no load-use hardware interlocks will completely stall upon a cache-miss of any of the operations that are scheduled to run in parallel. Other operations in the same or subsequent instruction words must stall. However, if coupled with non-blocking caches, the VLIW processor is capable of simultaneously resolving multiple loads from the same word. Existing instruction scheduling algorithms do not optimize for this VLIW-specific problem.

We propose Aligned Scheduling, a novel instruction scheduling algorithm that improves performance of VLIW processors with non-blocking caches by enabling them to better cope with unpredictable cache-memory latencies. Aligned Scheduling exploits the VLIW-specific cache-miss semantics to efficiently align cache misses on the same scheduling cycle, increasing the probability that they get serviced simultaneously. Our evaluation shows that Aligned Scheduling improves the performance of VLIW processors across a range of benchmarks from the Mediabench II and SPEC CINT2000 benchmark suites up to 20 %.

Keywords

  • Schedule Algorithm
  • Cache Size
  • Current Cycle
  • Load Instruction
  • Instruction Schedule

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This work was supported in part by the EC under grant ERA 249059 (FP7).

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-09967-5_16
  • Chapter length: 17 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   54.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-09967-5
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   69.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.

References

  1. Gcc: Gnu compiler collection. http://gcc.gnu.org

  2. ski IA64 simulator. http://ski.sourceforge.net

  3. SPEC benchmark. http://www.spec.org

  4. Branover, A., et al.: AMD Fusion APU: Llano. IEEE Micro 32(2), 28–37 (2012)

    CrossRef  Google Scholar 

  5. Dehnert, J., et al.: The Transmeta code morphing software: using speculation, recovery, and adaptive retranslation to address real-life challenges. In: CGO (2003)

    Google Scholar 

  6. Dehnert, J., et al.: Compiling for the Cydra. J. Supercomput. 7, 181–227 (1993)

    CrossRef  Google Scholar 

  7. Ding, C., Carr, S., Sweany, P.: Modulo scheduling with cache reuse information. In: Lengauer, C., Griebl, M., Gorlatch, S. (eds.) Euro-Par 1997. LNCS, vol. 1300, pp. 1079–1083. Springer, Heidelberg (1997)

    CrossRef  Google Scholar 

  8. Faraboschi, P., et al.: Lx: a technology platform for customizable VLIW embedded processing. In: ISCA (2000)

    Google Scholar 

  9. Fisher, J.: Trace scheduling: a technique for global microcode compaction. IEEE Trans. Comput. 30(7), 478–490 (1981)

    CrossRef  Google Scholar 

  10. Fisher, J.A., Faraboschi, P., Young, C.: VLIW processors. In: Padua, D. (ed.) Encyclopedia of Parallel Computing, pp. 2135–2142. Springer, Heidelberg (2011)

    Google Scholar 

  11. Fridman, J., Greenfield, Z.: The TigerSHARC DSP architecture. IEEE Micro 20(1), 66–176 (2000)

    CrossRef  Google Scholar 

  12. Fritts, J., et al.: Mediabench II video: expediting the next generation of video systems research. In: SPIE (2005)

    Google Scholar 

  13. Kerns, D., Eggers, S.: Balanced scheduling: instruction scheduling when memory latency is uncertain. In: PLDI (1993)

    Google Scholar 

  14. Klaiber, A., et al.: The technology behind Crusoe processors. Transmeta Corporation White Paper (2000)

    Google Scholar 

  15. Kroft, D.: Lockup-free instruction fetch/prefetch cache organization. In: ISCA (1981)

    Google Scholar 

  16. Lam, M.: Software pipelining: an effective scheduling technique for VLIW machines. In: PLDI (1988)

    Google Scholar 

  17. Lindenmaier, G., McKinley, K.S., Temam, O.: Load scheduling with profile information. In: Bode, A., Ludwig, T., Karl, W.C., Wismüller, R. (eds.) Euro-Par 2000. LNCS, vol. 1900, pp. 223–233. Springer, Heidelberg (2000)

    CrossRef  Google Scholar 

  18. Llosa, J.: Swing modulo scheduling: a lifetime-sensitive approach. In: PACT (1996)

    Google Scholar 

  19. Lo, J., et al.: Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism. In: PLDI (1995)

    Google Scholar 

  20. McNairy, C., et al.: Itanium 2 processor microarchitecture. IEEE Micro 23(2), 44–55 (2003)

    CrossRef  Google Scholar 

  21. Moon, S., et al.: An efficient resource-constrained global scheduling technique for superscalar and VLIW processors. In: MICRO (1992)

    Google Scholar 

  22. Pai, V., et al.: Code transformations to improve memory parallelism. In: MICRO (1999)

    Google Scholar 

  23. Pechanek, G., Vassiliadis, S.: The ManArrayTM embedded processor architecture. In: Euromicro (2000)

    Google Scholar 

  24. Rau, B., Glaeser, C.: Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In: Workshop on Microprogramming (1981)

    Google Scholar 

  25. Sánchez, F., González, A.: Cache sensitive modulo scheduling. In: MICRO (1997)

    Google Scholar 

  26. Scheurich, C., et al.: Lockup-free caches in high-performance multiprocessors. J. Parallel Distrib. Syst. 11(1), 25–36 (1991)

    CrossRef  Google Scholar 

  27. Sharangpanim, H., et al.: Itanium processor microarchitecture. IEEE Micro 20(5), 24–43 (2000)

    CrossRef  Google Scholar 

  28. Sohi, G., Franklin, M.: High-bandwidth data memory systems for superscalar processors. In: ASPLOS (1991)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vasileios Porpodas .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Porpodas, V., Cintra, M. (2014). Aligned Scheduling: Cache-Efficient Instruction Scheduling for VLIW Processors. In: Cașcaval, C., Montesinos, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2013. Lecture Notes in Computer Science(), vol 8664. Springer, Cham. https://doi.org/10.1007/978-3-319-09967-5_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-09967-5_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-09966-8

  • Online ISBN: 978-3-319-09967-5

  • eBook Packages: Computer ScienceComputer Science (R0)