Abstract
Within the past five years, many manufactures have added hardware performance counters to their microprocessors to generate profile data cheaply. We show how to use Compaq’s DCPI tool to determine load latencies which are at a fine, instruction granularity and use them as fodder for improving instruction scheduling. We validate our heuristic for using DCPI latency data to classify loads as hits and misses against simulation numbers. We map our classification into the Multiflow compiler’s intermediate representation, and use a locality sensitive Balanced scheduling algorithm. Our experiments illustrate that our algorithm improves run times by 1% on average, but up to 10% on a Compaq Alpha.
This work is supported by EU Project 28198; NSF grants EIA-9726401, CDA-9502639, and a CAREER Award CCR-9624209; Darpa grant 5-21425; Compaq and by LTR Esprit project 24942 MHAOTEU. Any opinions, findings, or conclusions expressed are the authors’ and not necessarily the sponsors’.
Chapter PDF
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
G. Ammons, T. Ball, and J. R. Larus. Exploiting hardware performance counters with flow and context sensitive profiling. In Proceedings of the SIGPLAN’ 97 Conference on Programming Language Design and Implementation, pages 85–96, Las Vegas, NV, June 1997.
G. Ammons and J. R. Larus. Improving data-flow analysis with path profiles. In Proceedings of the SIGPLAN’ 98 Conference on Programming Language Design and Implementation, pages 72–84, Montreal, Canada, June 1998.
J. M. Anderson, L. M. Berc, J. Dean, S. Ghemawat, M. R. Henzinger, S. A. Leung, R. L. Sites, M. T. Vandervoorde, C. A. Waldspurger, and W. E. Weihl. Continuous profiling: Where have all the cycles gone?ACM Transactions on Computer Systems, 15(4):357–390, November 1997.
S. Carr. Combining optimization for cache and instruction-level parallelism. In The 1996 International Conference on Parallel Architectures and Compilation Techniques, Boston, MA, October 1996.
J. Dean, J. E. Hicks, C. A. Waldspurger, W. E. Weihl, and G. Chrysos. ProfileMe: Hardware support for instruction level profiling on out-of-order processors. In Proceedings of the 30th International Symposium on Micro architecture, Research Triangle Park, NC, December 1997.
Chen Ding, Steve Carr, and Phil Sweany. Modulo scheduling with cache reuse information. In Proceedings of EuroPar’ 97, pages 1079–1083, August 1997.
J. A. Fisher. Trace scheduling: A technique for global microcode compaction. IEEE Transactions on Computers, C-30(7):478–490, July 1981.
D. R. Kerns and S. Eggers. Balanced scheduling: Instruction scheduling when memory latency is uncertain. In Proceedings of the SIGPLAN’ 93 Conference on Programming Language Design and Implementation, pages 278–289, Albuquerque, NM, June 1993.
C. Liao, M. Martonosi, and D. W. Clark. Performance monitoring in a myrinet-connected shrimp cluster. In 1998 ACM Sigmetrics Symposium on Parallel and Distributed Tools, August 1998.
J. L. Lo and S. J. Eggers. Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism. In Proceedings of the SIGPLAN’ 95 Conference on Programming Language Design and Implementation, pages 151–162, San Diego, CA, June 1995.
P. G. Lowney, S. M. Freudenberger, T. J. Karzes, W. D. Lichtenstein, R. P. Nix, J. S. O’Donnell, and J. C. Ruttenberg. The multiflow trace scheduling compiler. The Journal of Supercomputing, pages 51–143, 1993.
F. Jesus Sanchez and Antonio Gonzales. Cache sensitive modulo scheduling. In The 1997 International Conference on Parallel Architectures and Compilation Techniques, pages 261–271, November 1997.
A. Srivastava and A. Eustace. ATOM: A system for building customized program analysis tools. In Proceedings of the SIGPLAN’ 94 Conference on Programming Language Design and Implementation, pages 196–205, Orlando, FL, June 1994.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lindenmaier, G., McKinley, K.S., Temam, O. (2000). Load Scheduling with Profile Information. In: Bode, A., Ludwig, T., Karl, W., Wismüller, R. (eds) Euro-Par 2000 Parallel Processing. Euro-Par 2000. Lecture Notes in Computer Science, vol 1900. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44520-X_31
Download citation
DOI: https://doi.org/10.1007/3-540-44520-X_31
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67956-1
Online ISBN: 978-3-540-44520-3
eBook Packages: Springer Book Archive