Skip to main content
Log in

Instruction-level parallel processing: History, overview, and perspective

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Instruction-level parallelism (ILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a much more significant force in computer design. Several systems were built and sold commercially, which pushed ILP far beyond where it had been before, both in terms of the amount of ILP offered and in the central role ILP played in the design of the system. By the end of the decade, advanced microprocessor design at all major CPU manufacturers had incorporated ILP, and new techniques for ILP had become a popular topic at academic conferences. This article provides an overview and historical perspective of the field of ILP and its development over the past three decades.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Acosta, R.D., Kjelstrup, J., and Torng, H.C. 1986. An instruction issuing approach to enhancing performance in multiple function unit processors.IEEE Trans. Comps., C-35, 9 (Sept.): 815–828.

    Google Scholar 

  • Adam, T.L., Chandy, K.M., and Dickson, J.R. 1974. A comparison of list schedules for parallel processing systems.CACM, 17, 12 (Dec.): 685–690.

    Google Scholar 

  • Advanced Micro Devices. 1989.Am29000 Users Manual. Pub. no. 10620B, Advanced Micro Devices, Sunnyvale, Calif.

    Google Scholar 

  • Agerwala, T. 1976. Microprogram optimization: A survey.IEEE Trans. Comps., C-25, 10 (Oct.): 962–973.

    Google Scholar 

  • Agerwala, T., and Cocke, J. 1987. High performance reduced instruction set processors. Tech. rept. RC12434 (#55845), IBM Thomas J. Watson Research Center, Yorktown Heights, N.Y.

    Google Scholar 

  • Aho, A.V., and Johnson, S.C. 1976. Optimal code generation for expression trees.JACM, 23 3 (July): 488–501.

    Google Scholar 

  • Aho, A.V., Johnson, S.C., and Ullman, J.D. 1977a. Code generation for expressions with common subexpressions.JACM, 24, 1 (Jan.): 146–160.

    Google Scholar 

  • Aho, A.V., Johnson, S.C., and Ullman, J.D. 1977b. Code generation for machines with multiregister operations. InProc., Fourth ACM Symp. on Principles of Programming Languages, pp. 21–28.

  • Aiken, A., and Nicolau, A. 1988a. Optimal loop parallelization. InProc., SIGPLAN'88 Conf. on Programming Language Design and Implementation (Atlanta, June), pp. 308–317.

  • Aiken, A., and Nicolau, A. 1988b. Perfect pipelining: A new loop parallelization technique. InProc., 1988 European Symp. on Programming, Springer Verlag, New York, pp. 221–235.

    Google Scholar 

  • Aiken, A., and Nicolau, A. 1991. A realistic resource-constrained software pipelining algorithm. InAdvances in Languages and Compilers for Parallel Processing (A. Nicolau, D. Gelernter, T. Gross, and D. Padua, eds.), Pitman/MIT Press, London, pp. 274–290.

    Google Scholar 

  • Allen, J.R., Kennedy, K., Porterfield, C., and Warren, J. 1983. Conversion of control dependence to data dependence. InProc., Tenth Annual ACM Symp. on Principles of Programming Languages (Jan.): pp. 177–189.

    Google Scholar 

  • Anderson, D.W., Sparacio, F.J., and Tomasulo, R.M. 1967. The System/360 Model 91: Machine philosophy and instruction handling.IBM J. Res. and Dev., 11, 1 (Jan.): 8–24.

    Google Scholar 

  • Apollo Computer. 1988.The Series 10000 Personal Supercomputer: Inside a New Architecture. Publication no. 002402-007 2-88, Apollo Computer, Inc., Chelmsford, Mass.

    Google Scholar 

  • Arvind and Gostelow, K. 1982. The U-interpreter.Computer, 15, 2 (Feb.): 12–49.

    Google Scholar 

  • Arvind and Kathail, V. 1981. A multiple processor dataflow machine that supports generalised procedures. InProc., Eighth Annual Symp. on Computer Architecture (May): pp. 291–302.

    Google Scholar 

  • Auslander, M., and Hopkins, M. 1982. An overview of the PL.8 compiler. InProc., ACM SIGPLAN Symp. on Compiler Construction (Boston, June), pp. 22–31.

  • Bahr, R., Ciavaglia, S., Flahive, B., Kline, M., Mageau, P., and Nickel, D. 1991. The DN10000TX: A new high-performance PRISM processor. InProc., COMPCON '91, pp. 90–95.

  • Baker, K.R. 1974.Introduction to Sequencing and Scheduling. John Wiley, New York.

    Google Scholar 

  • Beck, G.R., Yen, D.W.L., and Anderson, T.L. 1993. The Cydra 5 minisupercomputer: Architecture and implementation.The J. Supercomputing, 7, 1/2: 143–180.

    Google Scholar 

  • Bell, C.G., and Newell, A. 1971.Computer Structures: Readings and Examples. McGraw-Hill, New York.

    Google Scholar 

  • Bernstein, D., and Rodeh, M. 1991. Global instruction scheduling for superscalar machines. InProc., SIGPLAN '91 Conf. on Programming Language Design and Implementation (June), pp. 241–255.

    Google Scholar 

  • Bernstein, D., Cohen, D., and Krawczyk, H. 1991. Code duplication: An assist for global instruction scheduling. InProc., 24th Annual Internat. Symp. on Microarchitecture (Albuquerque, N.Mex.), pp. 103–113.

  • Blanck, G., and Krueger, S. 1992. The SuperSPARCTM microprocessor. InProc., COMPCON '92, pp. 136–141.

  • Bloch, E. 1959. The engineering design of the STRETCH computer. InProc., Eastern Joint Computer Conf, pp. 48–59.

  • Bruno, J.L., and Sethi, R. 1976. Code generation for a one-register machine.JACM, 23, 3 (July): 502–510.

    Google Scholar 

  • Buchholz, W., ed. 1962.Planning a Computer System: Project Stretch. McGraw-Hill, New York.

    Google Scholar 

  • Butler, M., Yeh, T., Patt., Y., Alsup, M., Scales, H., and Shebanow, M. 1991. Single instruction stream parallelism is greater than two. InProc., Eighteenth Annual Internat. Symp. on Computer Architecture (Toronto), pp. 276–286.

  • Callahan, D., and Koblenz, B. 1991. Register allocation via hierarchical graph coloring. InProc., SIGPLAN '91 Conf. on Programming Language Design and Implementation (Toronto, June), pp. 192–203.

  • Callahan, D., Carr, S., and Kennedy, K. 1990. Improving register allocation for subscripted variables. InProc., ACM SIGPLAN '90 Conf. on Programming Language Design and Implementation, (White Plains, N.Y, June), pp. 53–65.

  • Carpenter, B.E., and Doran, R.W., eds. 1986.A.M. Turing's ACE Report of 1946 and Other Papers. MIT Press, Cambridge, Mass.

    Google Scholar 

  • Chaitin, G.J. 1982. Register allocation and spilling via graph coloring. InProc., ACM SIGPLAN Symp. on Compiler Construction (Boston, June), pp. 98–105.

  • Chang, P.P., and Hwu, W.W. 1988. Trace selection for compiling large C application programs to microcode. InProc., 21st Annual Workshop on Microprogramming and Microarchitectures (San Diego, Nov.), pp. 21–29.

  • Chang, P.P., and Hwu, W.W. 1992. Profile-guided automatic inline expansion for C programs.Software—Practice and Experience, 22, 5 (May): 349–376.

    Google Scholar 

  • Chang, P.P., Lavery, D.M., and Hwu, W.W. 1991. The importance of prepass code scheduling for superscalar and superpipelined processors. Tech. Rept. no. CRHC-91-18, Center for Reliable and High-Performance Computing, Univ. of Ill, Urbana-Champaign, Ill.

    Google Scholar 

  • Chang, P.P., Mahlke, S.A., Chen, W.Y., Warter, N.J., and Hwu, W.W. 1991. IMPACT: An architectural framework for multiple-instruction-issue processors. InProc., 18th Annual Internat. Symp. on Computer Architecture (Toronto, May), pp. 266–275.

  • Charlesworth, A.E. 1981. An approach to scientific array processing: The architectural design of the AP-120B/ FPS-164 family.Computer, 14, 9: 18–27.

    Google Scholar 

  • Chen, T.C. 1971. Parallelism, pipelining, and computer efficiency.Computer Design, 10, 1 (Jan.): 69–74,

    Google Scholar 

  • Chen, T.C. 1975. Overlap and pipeline processing. InIntroduction to Computer Architecture (H.S. Stone, ed.), Science Research Associates, Chicago, pp. 375–431.

    Google Scholar 

  • Chow, F., and Hennessy, J. 1984. Register allocation by priority-based coloring. InProc., ACM SIGPLAN Symp. on Compiler Construction (Montreal, June), pp. 222–232.

  • Chow, F.C., and Hennessy, J.L. 1990. The priority-based coloring approach to register allocation.ACM Trans. Programming Languages and Systems, 12 (Oct.): 501–536.

    Google Scholar 

  • Coffman, J.R., ed. 1976.Computer and Job-Shop Scheduling Theory. John Wiley, New York.

    Google Scholar 

  • Coffman, E.G., and Graham, R.L. 1972. Optimal scheduling for two processor systems.Acta Informatica, 1, 3: 200–213.

    Google Scholar 

  • Cohen, D. 1978. A methodology for programming a pipeline array processor. InProc., 11th Annual Microprogramming Workshop (Asilomar, Calif., Nov.), pp. 82–89.

  • Colwell, R.P., Nix, R.P., O'Donnell, J.J., Papworth, D.B., and Rodman, P.K. 1988. A VLIW architecture for a trace scheduling compiler.IEEE Trans. Comps., C-37, 8 (Aug.): 967–979.

    Google Scholar 

  • Colwell, R.P., Hall, W.E., Joshi, C.S., Papworth, D.B., Rodman, P.K., and Tornes, J.E. 1990. Architecture and implementation of a VLIW supercomputer. InProc., Supercomputing '90 (Nov.), pp. 910–919.

    Google Scholar 

  • Cotten, L.W. 1965. Circuit implementation of high-speed pipeline systems. InProc., AFIPS Fall Joint Computing Conf., pp. 489–504.

  • Cotten, L.W. 1969. Maximum-rate pipeline systems. InProc., AFIPS Spring Joint Computing Conf., 581–586.

  • Danelutto, M., and Vanneschi, M. 1990. VLIW in-the-large: A model for fine grain parallelism exploitation of distributed memory multiprocessors. InProc., 23rd Annual Workshop on Microprogramming and Microarchitecture (Nov.), pp. 7–16.

    Google Scholar 

  • Dasgupta, S., and Tartar, J. 1976. The identification of maximal parallelism in straight-line microprograms.IEEE Trans. Comps., C-25, 10 (Oct.): 986–991.

    Google Scholar 

  • Davidson, E.S. 1971. The design and control of pipelined function generators. InProc., 1971 Internat. IEEE Conf. on Systems, Networks, and Computers (Oaxtepec, Mexico, Jan.), pp. 19–21.

  • Davidson, E.S. 1974. Scheduling for pipelined processors. InProc., 7th Hawaii Conf. on Systems Sciences, pp. 58–60.

  • Davidson, S., Landskov, D., Shriver, B.D., and Mallett, P.W. 1981. Some experiments in local microcode compaction for horizontal machines.IEEE Trans. Comps., C-30, 7: 460–477.

    Google Scholar 

  • Davidson, E.S., Shar, L.E., Thomas, A.T., and Patel, J.H. 1975. Effective control for pipelined computers. InProc., COMPCON '90 (San Francisco, Feb.), pp. 181–184.

  • Dehnert, J.C., and Towle, R.A. 1993. Compiling for the Cydra 5.The J. Supercomputing, 7, 1/2: 181–227.

    Google Scholar 

  • Dehnert, J.C., Hsu, P.Y.-T., and Bratt, J.P. 1989. Overlapped loop support in the Cydra 5. InProc., Third Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Apr.), pp. 26–38.

  • DeLano, E., Walker, W., Yetter, J., and Forsyth, M. 1992. A high speed superscalar PA-RISC processor.In Proc., COMPCON '92 (Feb.), pp. 116–121.

  • DeWitt, DJ. 1975. A control word model for detecting conflicts between microprograms. InProc., 8th Annual Workshop on Microprogramming (Chicago, Sept.), pp. 6–12.

  • Diefendorff, K., and Allen, M. 1992. Organization of the Motorola 88110 superscalar RISC microprocessor.IEEE Micro, 12, 2 (Apr.): 40–63.

    Google Scholar 

  • Dongarra, J.J. 1986, A survey of high performance computers. InProc., COMPCON '86 (Mar.), pp. 8–11.

  • Dwyer, H., and Torng, H.C. 1992. An out-of-order superscalar processor with speculative execution and fast, precise interrupts. InProc., 25th Annual Internat. Symp. on Microarchitecture (Portland, Ore., Dec.), pp. 272–281.

  • Ebcioglu, K. 1988. Some design ideas for a VLIW architecture for sequential-natured software. InParallel Processing (Proc., IFIP WG 10.3 Working Conf. on Parallel Processing, Pisa, Italy) (M. Cosnard, M.H. Barton, and M. Vanneschi, eds.), North-Holland, pp. 3–21.

  • Ebcioglu, K., and Nakatani, T. 1989. A new compilation technique for parallelizing loops with unpredictable branches on a VLIW architecture. InLanguages and Compilers for Parallel Computing (D. Gelernter, A. Nicolau, and D. Padua, eds.), Pitman/MIT Press, London, pp. 213–229.

    Google Scholar 

  • Ebcioglu, K., and Nicolau, A. 1989. Aglobal resource-constrained parallelization technique. InProc., 3rd Internat. Conf. on Supercomputing (Crete, Greece, June), pp. 154–163.

  • Eckert, J.P., Chu, J.C., Tonik, A.B., and Schmitt, W.F. 1959. Design of UNIVAC-LARC System: I. InProc., Eastern Joint Computer Conf., pp. 59–65.

  • Ellis, J.R. 1986.Bulldog: A Compiler for VLIW Architectures. MIT Press, Cambridge, Mass.

    Google Scholar 

  • Fawcett, B.K. 1975. Maximal clocking rates for pipelined digital systems. M.S. thesis, Univ. of Ill., Urbana-Champaign, Ill.

    Google Scholar 

  • Fernandez, E.B., and Bussel, B. 1973. Bounds on the number of processors and time for multiprocessor optimal schedule.IEEE Trans. Comps., C-22, 8 (Aug.): 745–751.

    Google Scholar 

  • Fisher, J.A. 1979. The optimization of horizontal microcode within and beyond basic blocks: An application of processor scheduling with resources, Ph.D. thesis, New York Univ., New York.

    Google Scholar 

  • Fisher, J.A. 1980. 2N-way jump microinstruction hardware and an effective instruction binding method. InProc., 13th Annual Workshop on Microprogramming (Colorado Springs, Colo., Nov.), pp. 64–75.

  • Fisher, J.A. 1981. Trace scheduling: A technique for global microcode compaction.IEEE Trans. Comps., C-30, 7 (July): 478–490.

    Google Scholar 

  • Fisher, J.A. 1983. Very long instruction word architectures and the ELI-512. InProc., Tenth Annual Internat. Symp. on Computer Architecture (Stockholm, June), pp. 140–150.

  • Fisher, J.A. 1992. Trace Scheduling-2, an extension of trace scheduling. Tech. rept., Hewlett-Packard Laboratories.

  • Fisher, J.A., and Freudenberger, S.M. 1992. Predicting conditional jump directions from previous runs of a program. InProc., Fifth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Oct.), pp. 85–95.

  • Fisher, J.A., Landskov, D., and Shriver, B.D. 1981. Microcode compaction: Looking backward and looking forward. InProc., 1981 Nat. Computer Conf., pp. 95–102.

  • Fisher, J.A., Ellis, J.R., Ruttenberg, J.C., and Nicolau, A. 1984. Parallel processing: A smart compiler and a dumb machine. InProc., ACM SIGPLAN '84 Symp. on Compiler Construction (Montreal, June), pp. 37–47.

  • Floating Point Systems. 1979.FPS AP-120B Processor Handbook. Floating Point Systems, Inc., Beaverton, Ore.

    Google Scholar 

  • Foster, C.C., and Riseman, E.M. 1972. Percolation of code to enhance parallel dispatching and execution.IEEE Trans. Comps., C-21, 12 (Dec): 1411–1415.

    Google Scholar 

  • Franklin, M., and Sohi, G.S. 1992. The expandable split window paradigm for exploiting fine-grain parallelism. InProc. 19th Annual International Symp. on Computer Architecture (Gold Coast, Australia, May), pp. 58–67.

  • Freudenberger, S.M., and Ruttenberg, J.C. 1992. Phase ordering of register allocation and instruction scheduling. InCode Generation—Concepts, Tools, Techniques: Proc., Internat. Workshop on Code Generation, May 1991 (R. Giegerich, and S.L. Graham, eds.), Springer-Verlag, London, pp. 146–172.

    Google Scholar 

  • Gasperoni, F. 1989. Compilation techniques for VLIW architectures. Tech. rept. RC 14915, IBM Research Div., T.J. Watson Research Center, Yorktown Heights, N.Y.

    Google Scholar 

  • Gibbons, P.B., and Muchnick, S.S. 1986. Efficient instruction scheduling for a pipelined architecture. InProc., ACM SIGPLAN '86 Symp. on Compiler Construction (Palo Alto, Calif., July), pp. 11–16.

  • Golumbic, M.C., and Rainish, V. 1990. Instruction schedulig beyond basic blocks.IBM J. Res. and Dev., 34, 1 (Jan.): 93–97.

    Google Scholar 

  • Gonzalez, M.J. 1977. Deterministic processor scheduling.ACM Computer Surveys, 9, 3 (Sept.): 173–204.

    Google Scholar 

  • Goodman, J.R., and Hsu, W.-C. 1988. Code scheduling and register allocation in large basic blocks. InProc., 1988 Internat. Conf. on Supercomputing (St. Malo, France, July), pp. 442–452.

  • Grishman, R., and Su, B. 1983. A preliminary evaluation of trace scheduling for global microcode compaction.IEEE Trans. Comps., C-32, 12 (Dec): 1191–1194.

    Google Scholar 

  • Gross, T.R., and Hennessy, J.L. 1982. Optimizing delayed branches. InProc., 15th Annual Workshop on Micro-programming (Oct.), pp. 114–120.

  • Gross, T., and Ward, M. 1990. The suppression of compensation code. InAdvances in Languages and Compilers for Parallel Computing (A. Nicolau, D. Gelernter, T. Gross, and D. Padua, eds.), Pitman/MIT Press, London, pp. 260–273.

    Google Scholar 

  • Gurd, J., Kirkham, C.C., and Watson, I. 1985. The Manchester prototype dataflow computer.CACM, 28, 1(Jan.): 34–52.

    Google Scholar 

  • Hallin, T.G., and Flynn, M.J. 1972. Pipelining of arithmetic functions.IEEE Trans. Comps., C-21, 8 (Aug.): 880–886.

    Google Scholar 

  • Hendren, L.J., Gao, G.R., Altman, E.R., and Mukerji, C. 1992. Register allocation using cyclic interval graphs: A new approach to an old problem. ACAPS Tech. Memo 33, Advanced Computer Architecture and Program Structures Group, McGill Univ., Montreal.

    Google Scholar 

  • Hennessy, J.L., and Gross, T. 1983. Post-pass code optimization of pipelined constraints.ACM Trans. Programming Languages and Systems, 5, 3 (July): 422–448.

    Google Scholar 

  • Hennessy, J., Jouppi, N., Baskett, F., Gross, T., and Gill, J. 1982. Hardware/software tradeoffs for increased performance. InProc., Symp. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Mar.) pp. 2–11.

  • Hennessy, J. Jouppi, N., Przybylski, S., Rowen, C., Gross, T., Baskett, F., and Gill, J. 1982. MIPS: A microprocessor architecture. InProc., 15th Annual Workshop on Microprogramming (Palo Alto, Calif., Oct.), pp. 17–22.

  • Hintz, R.G., and Tate, D.P. 1972. Control Data STAR-100 processor design. InProc., COMPCON '72 (Sept.), pp. 1–4.

  • Hsu, P.Y.T. 1986. Highly concurrent scalar processing. Ph.D. thesis, Univ. of Ill., Urbana-Champaign, Ill.

    Google Scholar 

  • Hsu, P.Y.T., and Davidson, E.S. 1986. Highly concurrent scalar processing. InProc., Thirteenth Annual Internat. Symp. on Computer Architecture, pp. 386–395.

  • Hsu, W.-C. 1987. Register allocation and code scheduling for load/store architectures. Comp. Sci. Tech. Rept. no. 722, Univ. of Wisc., Madison.

    Google Scholar 

  • Hu, T.C. 1961. Parallel sequencing and assembly line problems.Operations Research, 9, 6: 841–848.

    Google Scholar 

  • Hwu, W.W., and Chang, P.P. 1988. Exploiting parallel microprocessor microarchitectures with a compiler code generator. In Proc.,15th Annual Internat. Symp. on Computer Architecture (Honolulu, May), pp. 45–53.

  • Hwu, W.W., and Patt, Y.N. 1986. HPSm, a high performance restricted data flow architecture having minimal functionality. InProc., 13th Annual Internat. Symp. on Computer Architecture (Tokyo, June), pp. 297–306.

  • Hwu, W.W., and Patt, Y.N. 1987. Checkpoint repair for out-of-order execution machines.IEEE Trans. Comps., C-36, 12 (Dec): 1496–1514.

    Google Scholar 

  • Hwu, W.W., Conte, T.M., and Chang, P.P. 1989. Comparing software and hardware schemes for reducing the cost of branches. InProc., 16th Annual Internat. Symp. on Computer Architecture (May), pp. 224–233.

  • Hwu, W.W., Mahlke, S.A., Chen, W.Y., Chang, P.P., Waiter, N.J., Bringmann, R.A., Ouellette, R.G., Hank, R.E., Kiyohara, T., Haab, G.E., Holm, J.G., and Lavery, D.M. 1993. The superblock: An effective technique for VLIW and superscalar compilation.The J. Supercomputing, 7, 1/2: 229–248.

    Google Scholar 

  • IBM. 1967.IBM J. Res. and Dev., 11, 1 (Jan.). Special issue on the System/360 Model 91.

  • IBM. 1976.IBM 3838 Array Processor Functional Characteristics. Pub. no. 6A24-3639-0, file no. S370-08, IBM Corp., Endicott, N.Y.

    Google Scholar 

  • IBM. 1990.IBM J. Res. and Dev., 34, 1 (Jan.). Special issue on the IBM RISC System/6000 processor.

  • Intel. 1989a.i860 64-Bit Microprocessor Programmer's Reference Manual. Pub. no. 240329-001, Intel Corp., Santa Clara, Calif.

    Google Scholar 

  • Intel. 1989b.80960CA User's Manual. Pub. no. 270710-001, Intel Corp., Santa Clara, Calif.

    Google Scholar 

  • Jain, S. 1991. Circular scheduling: A new technique to perform software pipelining. InProc., ACM SIGPLAN '91 Conf. on Programming Language Design and Implementation (June), pp. 219–228.

  • Johnson, M. 1991.Superscalar Microprocessor Design. Prentice-Hall, Englewood Cliffs, N.J.

    Google Scholar 

  • Jouppi, N.P. 1989. The nonuniform distribution of instruction-level and machine parallelism and its effect on performance.IEEE Trans. Comps., C-38, 12 (Dec): 1645–1658.

    Google Scholar 

  • Jouppi, N.P., and Wall, D. 1989. Available instruction level parallelism for superscalar and superpipelined machines. InProc., Third Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Apr.), pp. 272–282.

  • Kasahara, H., and Narita, S. 1984. Practical multiprocessor scheduling algorithms for efficient parallel processing.IEEE Trans. Comps., C-33, 11 (Nov.): 1023–1029.

    Google Scholar 

  • Keller, R.M. 1975. Look-ahead processors.Computing Surveys 7, 4 (Dec): 177–196.

    Google Scholar 

  • Kleir, R.L. 1974. A representation for the analysis of microprogram operation. InProc., 7th Annual Workshop on Microprogramming (Sept.), pp. 107–118.

  • Kleir, R.L., and Ramamoorthy, C.V. 1971. Optimization strategies for microprograms.IEEE Trans. Comps., C-20, 7 (July): 783–794.

    Google Scholar 

  • Kogge, P.M. 1973. Maximal rate pipelined solutions to recurrence programs. InProc., First Annual Symp. on Computer Architecture (Univ. of Fla., Gainesville, Dec), pp. 71–76.

    Google Scholar 

  • Kogge, P.M. 1974. Parallel solution of recurrence problems.IBM J. Res. and Dev., 18, 2 (Mar.): 138–148.

    Google Scholar 

  • Kogge, P.M. 1977a. Algorithm development for pipelined processors. InProc., 1977 Internat. Conf. on Parallel Processing (Aug.), p. 217.

  • Kogge, P.M. 1977b. The microprogramming of pipelined processors. InProc., 4th Annual Symp. on Computer Architecture (Mar.), pp. 63–69.

  • Kogge, P.M. 1981.The Architecture of Pipelined Computers. McGraw-Hill, New York.

    Google Scholar 

  • Kogge, P.M., and Stone, H.S. 1973. A parallel algorithm for the efficient solution of a general class of recurrence equations.IEEE Trans. Comps., C-22, 8 (Aug.): 786–793.

    Google Scholar 

  • Kohler, W.H. 1975. A preliminary evaluation of the critical path method for scheduling tasks on multiprocessor systems.IEEE Trans. Comps., C-24, 12 (Dec): 1235–1238.

    Google Scholar 

  • Kohn, L., and Margulis, N. 1989. Introducing the Intel i860 64-bit microprocessor.IEEE Micro, 9, 4 (Aug.): 15–30.

    Google Scholar 

  • Kunkel, S.R., and Smith, J.E. 1986. Optimal pipelining in supercomputers. InProc., 13th Annual Internat. Symp. on Computer Architecture (Tokyo, June), pp. 404–411.

  • Labrousse, J., and Slavenburg, G.A. 1988. CREATE-LIFE: A design system for high performance VLSI circuits. InProc., Internat. Conf. on Circuits and Devices, pp. 365–360.

  • Labrousse, J., and Slavenburg, G.A. 1990a. A 50 MHz microprocessor with a VLIW architecture. InProc., ISSCC '90 (San Francisco), pp. 44–45.

  • Labrousse, J., and Slavenburg, G.A. 1990b. CREATE-LIFE: A modular design approach for high performance ASICs. InProc., COMPCON '90 (San Francisco), pp. 427–433.

  • Lam, M.S.-L. 1987. A systolic array optimizing compiler. Ph.D. thesis, Carnegie Mellon Univ., Pittsburgh.

    Google Scholar 

  • Lam. M. 1988. Software pipelining: An effective scheduling technique for VLIW machines. InProc., ACM SIGPLAN '88 Conf. on Programming Language Design and Implementation (Atlanta, June), pp. 318–327.

  • Lam, M.S., and Wilson, R.P. 1992. Limits of control flow on parallelism. InProc., Nineteenth Internat. Symp. on Computer Architecture (Gold Coast, Australia, May), pp. 46–57.

  • Landskov, D., Davidson, S., Shriver, B., and Mallett, P.W. 1980. Local microcode compaction techniques.ACM Computer Surveys, 12, 3 (Sept.): 261–294.

    Google Scholar 

  • Lee, J.K.F., and Smith, A.J. 1984. Branch prediction strategies and branch target buffer design.Computer, 17, 1 (Jan.): 6–22.

    Google Scholar 

  • Lee, M., Tirumalai, P.P., and Ngai, T.-F. 1993. Software pipelining and superblock scheduling: Compilation techniques for VLIW machines. InProc., 26th Annual Hawaii Internat. Conf. on System Sciences (Hawaii, Jan.), vol. 1, pp. 202–213.

    Google Scholar 

  • Linn, J.L. 1988. Horizontal microcode compaction. InMicroprogramming and Firmware Engineering Methods (S. Habib, ed.), Van Nostrand Reinhold, New York, pp. 381–431.

    Google Scholar 

  • Lowney, P.G., Freudenberger, S.M., Karzes, T.J., Lichtenstein, W.D., Nix, R.P., O'Donnell, J.S., and Ruttenburg, J.C. 1993. The Multiflow trace scheduling compiler.The J. Supercomputing, 7, 1/2: 51–142.

    Google Scholar 

  • Mahlke, S.A., Chen, W.Y., Hwu, W.W., Rau, B.R., and Schlansker, M.S. 1992. Sentinel scheduling for VLIW and superscalar processors. InProc., Fifth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Oct.), pp. 238–247.

  • Mahlke, S.A., Lin, D.C., Chen, W.Y., Hank, R.E., and Bringmann, R.A. 1992. Effective compiler support for predicated execution using the hyperblock. InProc., 25th Annual Internat. Symp. on Microarchitecture (Dec), pp. 45–54.

  • Mallett, P.W. 1978. Methods of compacting microprograms. Ph.D. thesis, Univ. of Southwestern La., Lafayette, La.

    Google Scholar 

  • Mangione-Smith, W., Abraham, S.G., and Davidson, E.S. 1992. Register requirements of pipelined processors. InProc., Internat. Conf. on Supercomputing (Washington, D.C., July).

  • McFarling, S., and Hennessy, J. 1986. Reducing the cost of branches. InProc., Thirteenth Internat. Symp. on Computer Architecture (Tokyo, June), pp. 396–403.

  • Moon, S.-M., Ebcioglu, K. 1992. An efficient resource-constrained global scheduling technique for superscalar and VLIW processors. InProc., 25th Annual Internat. Symp. on Microarchitecture (Portland, Ore., Dec), pp. 55–71.

  • Nakatani, T., and Ebcioglu, K. 1990. Using a lookahead window in a compaction-based parallelizing compiler. InProc., 23rd Annual Workshop on Microprogramming and Microarchitecture (Orlando, Fla., Nov.), pp. 57–68.

  • Nicolau, A. 1984. Parallelism, memory anti-aliasing and correctness for trace scheduling compilers. Ph.D. thesis, Yale Univ., New Haven, Conn.

    Google Scholar 

  • Nicolau, A. 1985a. Percolation scheduling: A parallel compilation technique. Tech. Rept. TR 85-678, Dept. of Comp. Sci., Cornell, Ithaca, N.Y.

    Google Scholar 

  • Nicolau, A. 1985b. Uniform parallelism exploitation in ordinary programs. InProc., Internat. Conf. on Parallel Processing (Aug.), pp. 614–618.

  • Nicolau, A., and Fisher, J.A. 1981. Using an oracle to measure parallelism in single instruction stream programs. InProc., Fourteenth Annual Microprogramming Workshop (Oct.), pp. 171–182.

  • Nicolau, A., and Fisher, J.A. 1984. Measuring the parallelism available for very long instruction word architectures.IEEE Trans. Comps., C-33, 11 (Nov.): 968–976.

    Google Scholar 

  • Nicolau, A., and Potasman, R. 1990. Realistic scheduling: Compaction for pipelined architectures. InProc., 23rd Annual Workshop on Microprogramming and Microarchitecture (Orlando, Fla., Nov.), pp. 69–79.

  • Oehler, R.R., and Blasgen, M.W. 1991. IBM RISC System/6000: Architecture and performance.IEEE Micro, 11, 3 (June): 14.

    Google Scholar 

  • Papadopoulos, G.M., and Culler, D.E. 1990. Monsoon: An explicit token store architecture. InProc., Seventeenth Internat. Symp. on Computer Architecture (Seattle, May), pp. 82–91.

  • Park, J.C.H., and Schlansker, M.S. 1991. On predicated execution. Tech. Rept. HPL-91-58, Hewlett Packard Laboratories.

  • Patel, J.H. 1976. Improving the throughput of pipelines with delays and buffers. Ph.D. thesis, Univ. of Ill., Urbana-Champaign, Ill.

    Google Scholar 

  • Patel, J.H., and Davidson, E.S. 1976. Improving the throughput of a pipeline by insertion of delays. InProc., 3rd Annual Symp. on Computer Architecture (Jan.), pp. 159–164.

  • Patterson, D.A., and Sequin, C.H. 1981. RISC I: A reduced instruction set VLSI computer. InProc., 8th Annual Symp. on Computer Architecture (Minneapolis, May), pp. 443–450.

  • Peterson, C., Sutton, J., and Wiley, P., 1991. iWarp: A 100-MOPS, LIW microprocessor for multicomputers.IEEE Micro, 11, 3 (June): 26.

    Google Scholar 

  • Popescu, V., Schultz, M., Spracklen, J., Gibson, G., Lightner, B., and Isaman, D. 1991. The Metaflow architecture.IEEE Micro, 11, 3 (June): 10.

    Google Scholar 

  • Radin, G. 1982. The 801 minicomputer. InProc., Symp. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Mar.), pp. 39–47.

  • Ramakrishnan, S. 1992. Software pipelining in PA-RISC compilers.Hewlett-Packard J. (July): 39–45.

  • Ramamoorthy, C.V., and Gonzalez, M.J. 1969. A survey of techniques for recognizing parallel processable streams in computer programs. InProc., AFIPS Fall Joint Computing Conf., pp. 1–15.

  • Ramamoorthy, C.V., and Tsuchiya, M. 1974. A high level language for horizontal microprogramming.IEEE Trans. Comps., C-23: 791–802.

    Google Scholar 

  • Ramamoorthy, C.V, Chandy, K.M., and Gonzalez, M.J. 1972. Optimal scheduling strategies in a multiprocessor system.IEEE Trans. Comps., C-21, 2 (Feb.): 137–146.

    Google Scholar 

  • Rau, B.R. 1988. Cydra 5 Directed Dataflow architecture. InProc., COMPCON '88 (San Francisco, Mar.), pp. 106–113.

  • Rau, B.R. 1992. Data flow and dependence analysis for instruction level parallelism. InFourth Internat. Workshop on Languages and Compilers for Parallel Computing (U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, eds.), Springer-Verlag, pp. 236–250.

  • Rau, B.R., and Glaeser, CD. 1981. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. InProc., Fourteenth Annual Workshop on Microprogramming (Oct.), pp. 183–198.

  • Rau, B.R., Glaeser, C.D., and Greenawalt, E.M. 1982. Architectural support for the efficient generation of code for horizontal architectures. InProc., Symp. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Mar.), pp. 96–99.

  • Rau, B.R., Glaeser, CD., and Picard, R.L. 1982. Efficient code generation for horizontal architectures: Compiler techniques and architectural support. InProc., Ninth Annual Internat. Symp. on Computer Architecture (Apr.), pp. 131–139.

  • Rau, B.R., Lee, M., Tirumalai, P., and Schlansker, M.S. 1992. Register allocation for software pipelined loops. InProc., SIGPLAN '92 Conf. on Programming Language Design and Implementation (San Francisco, June 17–19), pp. 283–299.

  • Rau, B.R., Yen, D.W.L., Yen, W., and Towle, R.A. 1989. The Cydra 5 departmental supercomputer: Design philosophies, decisions and trade-offs.Computer, 22, 1 (Jan.): 12–34.

    Google Scholar 

  • Riseman, E.M., and Foster, C.C. 1972. The inhibition of potential parallelism by conditional jumps.IEEE Trans. Comps., C-21, 12 (Dec): 1405–1411.

    Google Scholar 

  • Ruggiero, J.F., and Coryell, D. A. 1969. An auxiliary processing system for array calculations.IBM Systems J., 8, 2: 118–135.

    Google Scholar 

  • Russell, R.M. 1978. The CRAY-1 computer system.CACM, 21: 63–72.

    Google Scholar 

  • Rymarczyk, J. 1982. Coding guidelines for pipelined processors. InProc., Symp. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Mar.), pp, 12–19.

  • Schmidt, U., and Caesar, K. 1991. Datawave: A single-chip multiprocessor for video applications.IEEE Micro, 11, 3 (June): 22.

    Google Scholar 

  • Schneck, P.B. 1987.Supercomputer Architecture. Kluwer Academic, Norwell, Mass.

    Google Scholar 

  • Schuette, M.A., and Shen, J.P. 1993. Instruction-level experimental evaluation of the Multiflow TRACE 14/300 VLIW computer.The J. Supercomputing, 7, 1/2: 249–271.

    Google Scholar 

  • Sethi, R. 1975. Complete register allocation problems.SIAM J. Computing, 4, 3: 226–248.

    Google Scholar 

  • Sethi, R., and Ullman, J.D. 1970. The generation of optimal code for arithmetic expressions,JACM, 17, 4 (Oct.): 715–728.

    Google Scholar 

  • Sites, R.L. 1978. Instruction ordering for the CRAY-1 computer. Tech. rept. 78-CS-023, Univ. of Calif., San Diego.

    Google Scholar 

  • Smith, J.E. 1981. A study of branch prediction strategies. InProc., Eighth Annual Internat. Symp. on Computer Architecture (May), pp. 135–148.

  • Smith, J.E. 1982. Decoupled access/execute architectures. InProc., Ninth Annual Internat. Symp. on Computer Architecture (Apr.), pp. 112–119.

  • Smith, J.E. 1989. Dynamic instruction scheduling and the Astronautics ZS-1.Computer, 22, 1 (Jan.): 21–35.

    Google Scholar 

  • Smith, J.E., and Pleszkun, A.R. 1988. Implementing precise interrupts in pipelined processors.IEEE Trans. Comps., C-37, 5 (May): 562–573.

    Google Scholar 

  • Smith, J.E., Dermer, G.E., Vanderwarn, B.D., Klinger, S.D., Roszewski, CM., Fowler, D.L., Scidmore, K.R., and Laudon, J.P. 1987. The ZS-1 central processor.In Proc., Second Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Oct.), pp. 199–204.

  • Smith, M.D., Horowitz, M., and Lam, M. 1992. Efficient superscalar performance through boosting. In Proc.,Fifth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Oct.), pp. 248–259.

  • Smith, M.D., Lam, M.S., and Horowitz, M.A. 1990. Boosting beyond static scheduling in a superscalar processor. InProc., Seventeenth Internat. Symp. on Computer Architecture (June), pp. 344–354.

  • Smotherman, M., Krishnamurthy, S., Aravind, P.S., and Hunnicutt, D. 1991. Efficient DAG construction and heuristic calculation for instruction scheduling. InProc., 24th Annual Internat. Workshop on Microarchitecture (Albuquerque, N.M., Nov.), pp. 93–102.

  • Sohi, G.S., and Vajapayem, S. 1987. Instruction issue logic for high-performance, interruptable pipelined processors. InProc., 14th Annual Symp. on Computer Architecture (Pittsburgh, June), pp. 27–36.

  • Su, B., and Ding, S. 1985. Some experiments in global microcode compaction. InProc., 18th Annual Workshop on Microprogramming (Asilomar, Calif., Nov.), pp. 175–180.

  • Su, B., and Wang, J. 1991a. GURPR*: A new global software pipelining algorithm. InProc., 24th Annual Internat. Symp. on Microarchitecture (Albuquerque, N.M., Nov.), pp. 212–216.

  • Su, B., and Wang, J. 1991b. Loop-carried dependence and the general URPR software pipelining approach. InProc., 24th Annual Hawaii Internat. Conf. on System Sciences (Hawaii, Jan.).

  • Su, B., Ding, S., and Jin, L. 1984. An improvement of trace scheduling for global microcode compaction. InProc., 17th Annual Workshop on Microprogramming (New Orleans, Oct.), pp. 78–85.

  • Su, B., Ding, S., and Xia, J. 1986. URPR—An extension of URCR for software pipelining. InProc., 19th Annual Workshop on Microprogramming (New York, Oct.), pp. 104–108.

  • Su, B., Ding, S., Wang, J., and Xia, J. 1987. GURPR—A method for global software pipelining. InProc., 20th Annual Workshop on Microprogramming (Colorado Springs, Colo., Dec), pp. 88–96.

  • Thistle, M.R., and Smith, B.J. 1988. A processor architecture for Horizon. InProc., Supercomputing '88, (Orlando, Fla., Nov.), pp. 35–41.

  • Thomas, A.T., and Davidson, E.S. 1974. Scheduling of multiconfigurable pipelines. InProc., 12th Annual Allerton Conf. on Circuits and Systems Theory (Allerton, Ill.), pp. 658–669.

  • Thornton, J.E. 1964. Parallel operation in the Control Data 6600. InProc., AFIPS Fall Joint Computer Conf., pp. 33–40.

  • Thornton, J.E. 1970.Design of a Computer—The Control Data 6600. Scott, Foresman, Glenview, Ill.

    Google Scholar 

  • Tirumalai, P., Lee, M., and Schlansker, M.S. 1990. Parallelization of loops with exits on pipelined architectures. InProc., Supercomputing '90 (Nov.), pp. 200–212.

  • Tjaden, G.S., and Flynn, M.J. 1970. Detection and parallel execution of parallel instructions.IEEE Trans. Comps., C-19, 10 (Oct.): 889–895.

    Google Scholar 

  • Tjaden, G.S., and Flynn, M.J. 1973. Representation of concurrency with ordering matrices.IEEE Trans. Comps., C-22, 8 (Aug.): 752–761.

    Google Scholar 

  • Tokoro, M., Tamura, E., and Takizuka, T. 1981. Optimization of microprograms.IEEE Trans. Comps., C-30, 7 (July): 491–504.

    Google Scholar 

  • Tokoro, M., Takizuka, T., Tamura, E., and Yamaura, I. 1978. A technique of global optimization of microprograms. InProc., 11th Annual Workshop on Microprogramming (Asilotnar, Calif., Nov.), pp. 41–50.

  • Tokoro, M., Tamura, E., Takase, K., and Tamaru, K. 1977. An approach to microprogram optimization considering resource occupancy and instruction formats. InProc., 10th Annual Workshop on Microprogramming (Niagara Falls, N.Y., Nov.), pp. 92–108.

  • Tomasulo, R.M. 1967. An efficient algorithm for exploiting multiple arithmetic units.IBM J. Res. and Dev., 11, 1 (Jan.): 25–33.

    Google Scholar 

  • Touzeau, R.F. 1984. A FORTRAN compiler for the FPS-164 scientific computer. InProc., ACM S1GPLAN '84 Symp. on Compiler Construction (Montreal), pp. 48–57.

  • Tsuchiya, M., and Gonzalez, M.J. 1974. An approach to optimization of horizontal microprograms. InProc., Seventh Annual Workshop on Microprogramming (Palo Alto, Calif.), pp. 85–90.

  • Tsuchiya, M., and Gonzalez, M.J. 1976. Toward optimization of horizontal microprograms,IEEE Trans. Comps., C-25, 10 (Oct.): 992–999.

    Google Scholar 

  • Uht, A.K. 1986. An efficient hardware algorithm to extract concurrency from general-purpose code. InProc., Nineteenth Annual Hawaii Conf. on System Sciences (Jan.), pp. 41–50.

  • Wall, D.W. 1991. Limits of instruction-level parallelism. InProc., Fourth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Santa Clara, Calif., Apr.), pp. 176–188.

  • Warren, H.S. 1990. Instruction scheduling for the IBM RISC System/6000 processor.IBM J. Res. and Dev., 34, 1 (Jan.): 85–92.

    Google Scholar 

  • Waiter, N.J., Bockhaus, J.W., Haab, G.E., and Subramanian, K. 1992. Enhanced modulo scheduling for loops with conditional branches. InProc., 25th Annual Internat. Symp. on Microarchitecture (Portland, Ore., Dec), pp. 170–179.

  • Watson, W.J. 1972. The TI ASC—A highly modular and flexible super computer architecture. InProc., AFIPS Fall Joint Computer Conf., pp. 221–228.

  • Wedig, R.G. 1982. Detection of concurrency in directly executed language instruction streams. Ph.D. thesis, Stanford Univ., Stanford, Calif.

    Google Scholar 

  • Weiss, S., and Smith, J.E. 1984. Instruction issue logic for pipelined supercomputers. InProc., 11th Annual Internat. Symp. on Computer Architecture, pp. 110–118.

  • Weiss, S., and Smith, J.E. 1987. A study of scalar compilation techniques for pipelined supercomputers. InProc., Second Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Oct.), pp. 105–109.

  • Wilkes, M.V. 1951. The best way to design an automatic calculating machine. InProc., Manchester Univ. Comp. Inaugural Conf. (Manchester, England, July), pp. 16–18.

  • Wilkes, M.V., and Stringer, J.B. 1953. Microprogramming and the design of the control circuits in an electronic digital computer. InProc., The Cambridge Philosophical Society, Part 2 (Apr.), pp. 230–238.

  • Wolfe, A., and Shen, J.P. 1991. A variable instruction stream extension to the VLIW architecture. InProc., Fourth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Santa Clara, Calif., Apr.), pp. 2–14.

  • Wood, G. 1978. On the packing of micro-operations into micro-instruction words. InProc., 11th Annual Workshop on Microprogramming (Asilomar, Calif., Nov.), pp. 51–55.

  • Wood, G. 1979. Global optimization of microprograms through modular control constructs. InProc., 12th Annual Workshop on Microprogramming (Hershey, Penn.), pp. 1–6.

  • Yau, S.S., Schowe, A.C. and Tsuchiya, M. 1974. On storage optimization of horizontal microprograms. InProc., Seventh Annual Workshop on Microprogramming (Palo Alto, Calif.), pp. 98–106.

  • Yeh, T.Y., and Patt, Y.N. 1992. Alternative implementations of two-level adaptive branch prediction. InProc., Nineteenth Internat. Symp. on Comp. Architecture (Gold Coast, Australia, May), pp. 124–134.

  • Zima, H., and Chapman, B. 1990.Supercompilers for Parallel and Vector Computers. Addison-Wesley, Reading, Mass.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rau, B.R., Fisher, J.A. Instruction-level parallel processing: History, overview, and perspective. J Supercomput 7, 9–50 (1993). https://doi.org/10.1007/BF01205181

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF01205181

Keywords

Navigation