Abstract
The performance of statically scheduled VLIW processors is highly sensitive to the instruction scheduling performed by the compiler. In this work we identify a major deficiency in existing instruction scheduling for VLIW processors. Unlike most dynamically scheduled processors, a VLIW processor with no load-use hardware interlocks will completely stall upon a cache-miss of any of the operations that are scheduled to run in parallel. Other operations in the same or subsequent instruction words must stall. However, if coupled with non-blocking caches, the VLIW processor is capable of simultaneously resolving multiple loads from the same word. Existing instruction scheduling algorithms do not optimize for this VLIW-specific problem.
We propose Aligned Scheduling, a novel instruction scheduling algorithm that improves performance of VLIW processors with non-blocking caches by enabling them to better cope with unpredictable cache-memory latencies. Aligned Scheduling exploits the VLIW-specific cache-miss semantics to efficiently align cache misses on the same scheduling cycle, increasing the probability that they get serviced simultaneously. Our evaluation shows that Aligned Scheduling improves the performance of VLIW processors across a range of benchmarks from the Mediabench II and SPEC CINT2000 benchmark suites up to 20 %.
This work was supported in part by the EC under grant ERA 249059 (FP7).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Gcc: Gnu compiler collection. http://gcc.gnu.org
ski IA64 simulator. http://ski.sourceforge.net
SPEC benchmark. http://www.spec.org
Branover, A., et al.: AMD Fusion APU: Llano. IEEE Micro 32(2), 28–37 (2012)
Dehnert, J., et al.: The Transmeta code morphing software: using speculation, recovery, and adaptive retranslation to address real-life challenges. In: CGO (2003)
Dehnert, J., et al.: Compiling for the Cydra. J. Supercomput. 7, 181–227 (1993)
Ding, C., Carr, S., Sweany, P.: Modulo scheduling with cache reuse information. In: Lengauer, C., Griebl, M., Gorlatch, S. (eds.) Euro-Par 1997. LNCS, vol. 1300, pp. 1079–1083. Springer, Heidelberg (1997)
Faraboschi, P., et al.: Lx: a technology platform for customizable VLIW embedded processing. In: ISCA (2000)
Fisher, J.: Trace scheduling: a technique for global microcode compaction. IEEE Trans. Comput. 30(7), 478–490 (1981)
Fisher, J.A., Faraboschi, P., Young, C.: VLIW processors. In: Padua, D. (ed.) Encyclopedia of Parallel Computing, pp. 2135–2142. Springer, Heidelberg (2011)
Fridman, J., Greenfield, Z.: The TigerSHARC DSP architecture. IEEE Micro 20(1), 66–176 (2000)
Fritts, J., et al.: Mediabench II video: expediting the next generation of video systems research. In: SPIE (2005)
Kerns, D., Eggers, S.: Balanced scheduling: instruction scheduling when memory latency is uncertain. In: PLDI (1993)
Klaiber, A., et al.: The technology behind Crusoe processors. Transmeta Corporation White Paper (2000)
Kroft, D.: Lockup-free instruction fetch/prefetch cache organization. In: ISCA (1981)
Lam, M.: Software pipelining: an effective scheduling technique for VLIW machines. In: PLDI (1988)
Lindenmaier, G., McKinley, K.S., Temam, O.: Load scheduling with profile information. In: Bode, A., Ludwig, T., Karl, W.C., Wismüller, R. (eds.) Euro-Par 2000. LNCS, vol. 1900, pp. 223–233. Springer, Heidelberg (2000)
Llosa, J.: Swing modulo scheduling: a lifetime-sensitive approach. In: PACT (1996)
Lo, J., et al.: Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism. In: PLDI (1995)
McNairy, C., et al.: Itanium 2 processor microarchitecture. IEEE Micro 23(2), 44–55 (2003)
Moon, S., et al.: An efficient resource-constrained global scheduling technique for superscalar and VLIW processors. In: MICRO (1992)
Pai, V., et al.: Code transformations to improve memory parallelism. In: MICRO (1999)
Pechanek, G., Vassiliadis, S.: The ManArrayTM embedded processor architecture. In: Euromicro (2000)
Rau, B., Glaeser, C.: Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In: Workshop on Microprogramming (1981)
Sánchez, F., González, A.: Cache sensitive modulo scheduling. In: MICRO (1997)
Scheurich, C., et al.: Lockup-free caches in high-performance multiprocessors. J. Parallel Distrib. Syst. 11(1), 25–36 (1991)
Sharangpanim, H., et al.: Itanium processor microarchitecture. IEEE Micro 20(5), 24–43 (2000)
Sohi, G., Franklin, M.: High-bandwidth data memory systems for superscalar processors. In: ASPLOS (1991)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Porpodas, V., Cintra, M. (2014). Aligned Scheduling: Cache-Efficient Instruction Scheduling for VLIW Processors. In: Cașcaval, C., Montesinos, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2013. Lecture Notes in Computer Science(), vol 8664. Springer, Cham. https://doi.org/10.1007/978-3-319-09967-5_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-09967-5_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09966-8
Online ISBN: 978-3-319-09967-5
eBook Packages: Computer ScienceComputer Science (R0)