Incremental performance contributions of hardware concurrency extraction techniques

  • Augustus K. Uht
Session 4B: Compilers And Restructuring Techniques II
Part of the Lecture Notes in Computer Science book series (LNCS, volume 297)


Recently, new techniques for the hardware extraction of low-level concurrency from sequential instruction streams have been proposed in the form of the CONDEL machine models. The initial technique increased the concurrency extracted by using reduced (minimal) semantic dependencies arising from branches (procedural dependencies). Another scheme additionally implements reduced data dependencies. Also, a form of branch prediction is used to further improve performance. Although it has been demonstrated that all of these and other techniques improve performance, the relative degree of improvement for each new technique has not been shown; nor have the new set of procedural dependencies been enumerated. These issues are addressed in this paper.

Data dependencies are defined and discussed. A minimal set of procedural dependencies is presented and described, and examples are given. CONDEL and other hardware concurrent machine models are compared, in terms of degrees of concurrency realized. Specific performance data for the CONDEL machine models are obtained and analyzed. The experimental method, using simulations, is described. The simulation results of both a general purpose set and a scientific set of benchmarks are presented and analyzed, showing that the performance improvements for each concurrency extraction technique are orthogonal. Most of the concurrency extraction techniques demonstrate significant incremental performance contributions, with the exception of both the technique giving reduced data dependencies, and that having enhanced handling of special types of branches, such as calls and returns. It must be noted, though, that the reduced data dependency technique should still be used since it is a crucial part of the branch prediction technique, use of which does give a large performance improvement. In summary, all of the concurrency extraction techniques are useful in improving performance.


Data Dependency Machine Model Target Address Branch Prediction Assignment Instruction 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Acosta, R. D., Kjelstrup, J., and Torng, H. C. An Instruction Issuing Approach to Enhancing Performance in Multiple Functional Unit Processors. IEEE Transactions on Computers C-35:815–828, September, 1986.Google Scholar
  2. [2]
    Anderson, D. W., Sparacio, F. J. and Tomasulo, R. M. The IBM System/360 Model 91: Machine Philosophy and Instruction-Handling. IBM Journal:8–24, January, 1967.Google Scholar
  3. [3]
    Baskett, Forest. The Puzzle Benchmark. This is an undocumented compute-bound program; obtained via Marc Rose of CMU and Scott Wakefield of Stanford University; fits a piece into a 3-D puzzle.Google Scholar
  4. [4]
    Bernstein, A. J. Analysis of Programs for Parallel Processing. IEEE Transactions on Electronic Computers EC-15;757–763, October, 1966.Google Scholar
  5. [5]
    Carnegie-Mellon University. MCF Test Programs and Data Specification. Benchmark programs in a generic language, and instructions for their coding and use.Google Scholar
  6. [6]
    Chambers, J.M. Algorithm 410-Partial Sorting. Communications of the ACM 14:357–358, May, 1971.Google Scholar
  7. [7]
    Fisher, J. A. Trace Scheduling: A Technique for Global Microcode Compaction. IEEE Transactions on Computers C-30(7), July, 1981.Google Scholar
  8. [8]
    Fisher, J. A. Very Long Instruction Word Architectures and the ELI-512. In Proceedings of the 10th Annual International Symposium on Computer Architecture, pages 140–150. ACM-SIGARCH and the IEEE Computer Society, June, 1983.Google Scholar
  9. [9]
    Grogono, Peter. Programming in PASCAL. Addison-Wesley Publishing Company, Inc., Reading, Massachusetts, 1980.Google Scholar
  10. [10]
    Hoevel, L. W. and Wakefield, S. The HardShuffle Program. Benchmark moving (shuffling) data between two arrays.Google Scholar
  11. [11]
    Hwu, W., Melvin, S., Shebanow, M., Chen, C., Wei, J., and Patt, Y. An HPS Implementation of VAX; Initial Design and Analysis. In Proceedings of the Nineteenth Annual Hawaii International Conference on System Sciences. University of Hawaii, in cooperation with the ACM and the IEEE Computer Society, January, 1986.Google Scholar
  12. [12]
    Hwu, W. and Patt, Y. HPSm, a High Performance Restricted Data Flow Architecture Having Minimal Functionality. In Proceedings of the 13th Annual Symposium on Computer Architecture, pages 297–306. ACM-IEEE, June, 1986.Google Scholar
  13. [13]
    Lee, J. K. F. and Smith, A. J. Branch Prediction Strategies and Branch Target Buffer Design. COMPUTER, IEEE Computer Society 17(1):6–22, January, 1984.Google Scholar
  14. [14]
    Patt, Y., Hwu, W., and Shebanow, M. HPS, a New Microarchitecture: Rationale and Introduction. In Proceedings of MICRO-18, pages 103–108. ACM, December, 1985.Google Scholar
  15. [15]
    Patterson, D. A. and Sequin, C. H. A VLSI RISC. COMPUTER, IEEE Computer Society 15(9):8–21, September, 1982.Google Scholar
  16. [16]
    Riganati, J. P. and Schneck, P. B. Supercomputing. COMPUTER, IEEE Computer Society 17(10):97–113, October, 1984.Google Scholar
  17. [17]
    Thorton, J. E. Parallel Operation in the Control Data 6600. In Proceedings of the Fall Joint Computer Conference, pages 33–40. AFIPS, 1964.Google Scholar
  18. [18]
    Tjaden, G. S. Representation and Detection of Concurrency Using Ordering Matrices. PhD thesis, The Johns Hopkins University, 1972.Google Scholar
  19. [19]
    Tjaden, G. S. and Flynn, M. J. Representation of Concurrency with Ordering Matrices. IEEE Transactions on Computers C-22(8):752–761, August, 1973.Google Scholar
  20. [20]
    Tomasulo, R. M. An Efficient Algorithm for Expoiting Multiple Arithmetic Units. IBM Journal:25–33, January, 1967.Google Scholar
  21. [21]
    Uht, A. K. Hardware Extraction of Low-Level Concurrency from Sequential Instruction Streams. PhD thesis, Carnegie-Mellon University, December, 1985. Available from University Microfilms International, Ann Arbor, Michigan, U.S.A.Google Scholar
  22. [22]
    Uht, A. K. An Efficient Hardware Algorithm to Extract Concurrency From General-Purpose Code. In Proceedings of the Nineteenth Annual Hawaii International Conference on System Sciences. University of Hawaii, in cooperation with the ACM and the IEEE Computer Society, January, 1986.Google Scholar
  23. [23]
    Uht, A. K. and Wedig, R. G. Hardware Extraction of Low-level Concurrency from Serial Instruction Streams. In Proceedings of the International Conference on Parallel Processing, pages 729–736. IEEE Computer Society and the Association for Computing Machinery, August, 1986.Google Scholar
  24. [24]
    Wedig, R. G. Detection of Concurrency in Directly Executed Language Instruction Stream. PhD thesis, Stanford University, June, 1982.Google Scholar
  25. [25]
    Weicker, R. P. Dhrystone: A Synthetic Systems Programming Benchmark. Communications of the ACM 27(10):1013–1030, October, 1984.Google Scholar
  26. [26]
    Yohe, J. M. Algorithm 428 — Hu-Tucker Minimum Redundancy Alphabetic Coding Method. Communications of the ACM 15:360–362, May, 1972.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1988

Authors and Affiliations

  • Augustus K. Uht
    • 1
  1. 1.Dept. of Computer Science and Engineering, C-014University of California, San DiegoLa JollaU.S.A.

Personalised recommendations