Abstract
Instruction-level parallelism (ILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a much more significant force in computer design. Several systems were built and sold commercially, which pushed ILP far beyond where it had been before, both in terms of the amount of ILP offered and in the central role ILP played in the design of the system. By the end of the decade, advanced microprocessor design at all major CPU manufacturers had incorporated ILP, and new techniques for ILP had become a popular topic at academic conferences. This article provides an overview and historical perspective of the field of ILP and its development over the past three decades.
Similar content being viewed by others
References
Acosta, R.D., Kjelstrup, J., and Torng, H.C. 1986. An instruction issuing approach to enhancing performance in multiple function unit processors.IEEE Trans. Comps., C-35, 9 (Sept.): 815–828.
Adam, T.L., Chandy, K.M., and Dickson, J.R. 1974. A comparison of list schedules for parallel processing systems.CACM, 17, 12 (Dec.): 685–690.
Advanced Micro Devices. 1989.Am29000 Users Manual. Pub. no. 10620B, Advanced Micro Devices, Sunnyvale, Calif.
Agerwala, T. 1976. Microprogram optimization: A survey.IEEE Trans. Comps., C-25, 10 (Oct.): 962–973.
Agerwala, T., and Cocke, J. 1987. High performance reduced instruction set processors. Tech. rept. RC12434 (#55845), IBM Thomas J. Watson Research Center, Yorktown Heights, N.Y.
Aho, A.V., and Johnson, S.C. 1976. Optimal code generation for expression trees.JACM, 23 3 (July): 488–501.
Aho, A.V., Johnson, S.C., and Ullman, J.D. 1977a. Code generation for expressions with common subexpressions.JACM, 24, 1 (Jan.): 146–160.
Aho, A.V., Johnson, S.C., and Ullman, J.D. 1977b. Code generation for machines with multiregister operations. InProc., Fourth ACM Symp. on Principles of Programming Languages, pp. 21–28.
Aiken, A., and Nicolau, A. 1988a. Optimal loop parallelization. InProc., SIGPLAN'88 Conf. on Programming Language Design and Implementation (Atlanta, June), pp. 308–317.
Aiken, A., and Nicolau, A. 1988b. Perfect pipelining: A new loop parallelization technique. InProc., 1988 European Symp. on Programming, Springer Verlag, New York, pp. 221–235.
Aiken, A., and Nicolau, A. 1991. A realistic resource-constrained software pipelining algorithm. InAdvances in Languages and Compilers for Parallel Processing (A. Nicolau, D. Gelernter, T. Gross, and D. Padua, eds.), Pitman/MIT Press, London, pp. 274–290.
Allen, J.R., Kennedy, K., Porterfield, C., and Warren, J. 1983. Conversion of control dependence to data dependence. InProc., Tenth Annual ACM Symp. on Principles of Programming Languages (Jan.): pp. 177–189.
Anderson, D.W., Sparacio, F.J., and Tomasulo, R.M. 1967. The System/360 Model 91: Machine philosophy and instruction handling.IBM J. Res. and Dev., 11, 1 (Jan.): 8–24.
Apollo Computer. 1988.The Series 10000 Personal Supercomputer: Inside a New Architecture. Publication no. 002402-007 2-88, Apollo Computer, Inc., Chelmsford, Mass.
Arvind and Gostelow, K. 1982. The U-interpreter.Computer, 15, 2 (Feb.): 12–49.
Arvind and Kathail, V. 1981. A multiple processor dataflow machine that supports generalised procedures. InProc., Eighth Annual Symp. on Computer Architecture (May): pp. 291–302.
Auslander, M., and Hopkins, M. 1982. An overview of the PL.8 compiler. InProc., ACM SIGPLAN Symp. on Compiler Construction (Boston, June), pp. 22–31.
Bahr, R., Ciavaglia, S., Flahive, B., Kline, M., Mageau, P., and Nickel, D. 1991. The DN10000TX: A new high-performance PRISM processor. InProc., COMPCON '91, pp. 90–95.
Baker, K.R. 1974.Introduction to Sequencing and Scheduling. John Wiley, New York.
Beck, G.R., Yen, D.W.L., and Anderson, T.L. 1993. The Cydra 5 minisupercomputer: Architecture and implementation.The J. Supercomputing, 7, 1/2: 143–180.
Bell, C.G., and Newell, A. 1971.Computer Structures: Readings and Examples. McGraw-Hill, New York.
Bernstein, D., and Rodeh, M. 1991. Global instruction scheduling for superscalar machines. InProc., SIGPLAN '91 Conf. on Programming Language Design and Implementation (June), pp. 241–255.
Bernstein, D., Cohen, D., and Krawczyk, H. 1991. Code duplication: An assist for global instruction scheduling. InProc., 24th Annual Internat. Symp. on Microarchitecture (Albuquerque, N.Mex.), pp. 103–113.
Blanck, G., and Krueger, S. 1992. The SuperSPARCTM microprocessor. InProc., COMPCON '92, pp. 136–141.
Bloch, E. 1959. The engineering design of the STRETCH computer. InProc., Eastern Joint Computer Conf, pp. 48–59.
Bruno, J.L., and Sethi, R. 1976. Code generation for a one-register machine.JACM, 23, 3 (July): 502–510.
Buchholz, W., ed. 1962.Planning a Computer System: Project Stretch. McGraw-Hill, New York.
Butler, M., Yeh, T., Patt., Y., Alsup, M., Scales, H., and Shebanow, M. 1991. Single instruction stream parallelism is greater than two. InProc., Eighteenth Annual Internat. Symp. on Computer Architecture (Toronto), pp. 276–286.
Callahan, D., and Koblenz, B. 1991. Register allocation via hierarchical graph coloring. InProc., SIGPLAN '91 Conf. on Programming Language Design and Implementation (Toronto, June), pp. 192–203.
Callahan, D., Carr, S., and Kennedy, K. 1990. Improving register allocation for subscripted variables. InProc., ACM SIGPLAN '90 Conf. on Programming Language Design and Implementation, (White Plains, N.Y, June), pp. 53–65.
Carpenter, B.E., and Doran, R.W., eds. 1986.A.M. Turing's ACE Report of 1946 and Other Papers. MIT Press, Cambridge, Mass.
Chaitin, G.J. 1982. Register allocation and spilling via graph coloring. InProc., ACM SIGPLAN Symp. on Compiler Construction (Boston, June), pp. 98–105.
Chang, P.P., and Hwu, W.W. 1988. Trace selection for compiling large C application programs to microcode. InProc., 21st Annual Workshop on Microprogramming and Microarchitectures (San Diego, Nov.), pp. 21–29.
Chang, P.P., and Hwu, W.W. 1992. Profile-guided automatic inline expansion for C programs.Software—Practice and Experience, 22, 5 (May): 349–376.
Chang, P.P., Lavery, D.M., and Hwu, W.W. 1991. The importance of prepass code scheduling for superscalar and superpipelined processors. Tech. Rept. no. CRHC-91-18, Center for Reliable and High-Performance Computing, Univ. of Ill, Urbana-Champaign, Ill.
Chang, P.P., Mahlke, S.A., Chen, W.Y., Warter, N.J., and Hwu, W.W. 1991. IMPACT: An architectural framework for multiple-instruction-issue processors. InProc., 18th Annual Internat. Symp. on Computer Architecture (Toronto, May), pp. 266–275.
Charlesworth, A.E. 1981. An approach to scientific array processing: The architectural design of the AP-120B/ FPS-164 family.Computer, 14, 9: 18–27.
Chen, T.C. 1971. Parallelism, pipelining, and computer efficiency.Computer Design, 10, 1 (Jan.): 69–74,
Chen, T.C. 1975. Overlap and pipeline processing. InIntroduction to Computer Architecture (H.S. Stone, ed.), Science Research Associates, Chicago, pp. 375–431.
Chow, F., and Hennessy, J. 1984. Register allocation by priority-based coloring. InProc., ACM SIGPLAN Symp. on Compiler Construction (Montreal, June), pp. 222–232.
Chow, F.C., and Hennessy, J.L. 1990. The priority-based coloring approach to register allocation.ACM Trans. Programming Languages and Systems, 12 (Oct.): 501–536.
Coffman, J.R., ed. 1976.Computer and Job-Shop Scheduling Theory. John Wiley, New York.
Coffman, E.G., and Graham, R.L. 1972. Optimal scheduling for two processor systems.Acta Informatica, 1, 3: 200–213.
Cohen, D. 1978. A methodology for programming a pipeline array processor. InProc., 11th Annual Microprogramming Workshop (Asilomar, Calif., Nov.), pp. 82–89.
Colwell, R.P., Nix, R.P., O'Donnell, J.J., Papworth, D.B., and Rodman, P.K. 1988. A VLIW architecture for a trace scheduling compiler.IEEE Trans. Comps., C-37, 8 (Aug.): 967–979.
Colwell, R.P., Hall, W.E., Joshi, C.S., Papworth, D.B., Rodman, P.K., and Tornes, J.E. 1990. Architecture and implementation of a VLIW supercomputer. InProc., Supercomputing '90 (Nov.), pp. 910–919.
Cotten, L.W. 1965. Circuit implementation of high-speed pipeline systems. InProc., AFIPS Fall Joint Computing Conf., pp. 489–504.
Cotten, L.W. 1969. Maximum-rate pipeline systems. InProc., AFIPS Spring Joint Computing Conf., 581–586.
Danelutto, M., and Vanneschi, M. 1990. VLIW in-the-large: A model for fine grain parallelism exploitation of distributed memory multiprocessors. InProc., 23rd Annual Workshop on Microprogramming and Microarchitecture (Nov.), pp. 7–16.
Dasgupta, S., and Tartar, J. 1976. The identification of maximal parallelism in straight-line microprograms.IEEE Trans. Comps., C-25, 10 (Oct.): 986–991.
Davidson, E.S. 1971. The design and control of pipelined function generators. InProc., 1971 Internat. IEEE Conf. on Systems, Networks, and Computers (Oaxtepec, Mexico, Jan.), pp. 19–21.
Davidson, E.S. 1974. Scheduling for pipelined processors. InProc., 7th Hawaii Conf. on Systems Sciences, pp. 58–60.
Davidson, S., Landskov, D., Shriver, B.D., and Mallett, P.W. 1981. Some experiments in local microcode compaction for horizontal machines.IEEE Trans. Comps., C-30, 7: 460–477.
Davidson, E.S., Shar, L.E., Thomas, A.T., and Patel, J.H. 1975. Effective control for pipelined computers. InProc., COMPCON '90 (San Francisco, Feb.), pp. 181–184.
Dehnert, J.C., and Towle, R.A. 1993. Compiling for the Cydra 5.The J. Supercomputing, 7, 1/2: 181–227.
Dehnert, J.C., Hsu, P.Y.-T., and Bratt, J.P. 1989. Overlapped loop support in the Cydra 5. InProc., Third Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Apr.), pp. 26–38.
DeLano, E., Walker, W., Yetter, J., and Forsyth, M. 1992. A high speed superscalar PA-RISC processor.In Proc., COMPCON '92 (Feb.), pp. 116–121.
DeWitt, DJ. 1975. A control word model for detecting conflicts between microprograms. InProc., 8th Annual Workshop on Microprogramming (Chicago, Sept.), pp. 6–12.
Diefendorff, K., and Allen, M. 1992. Organization of the Motorola 88110 superscalar RISC microprocessor.IEEE Micro, 12, 2 (Apr.): 40–63.
Dongarra, J.J. 1986, A survey of high performance computers. InProc., COMPCON '86 (Mar.), pp. 8–11.
Dwyer, H., and Torng, H.C. 1992. An out-of-order superscalar processor with speculative execution and fast, precise interrupts. InProc., 25th Annual Internat. Symp. on Microarchitecture (Portland, Ore., Dec.), pp. 272–281.
Ebcioglu, K. 1988. Some design ideas for a VLIW architecture for sequential-natured software. InParallel Processing (Proc., IFIP WG 10.3 Working Conf. on Parallel Processing, Pisa, Italy) (M. Cosnard, M.H. Barton, and M. Vanneschi, eds.), North-Holland, pp. 3–21.
Ebcioglu, K., and Nakatani, T. 1989. A new compilation technique for parallelizing loops with unpredictable branches on a VLIW architecture. InLanguages and Compilers for Parallel Computing (D. Gelernter, A. Nicolau, and D. Padua, eds.), Pitman/MIT Press, London, pp. 213–229.
Ebcioglu, K., and Nicolau, A. 1989. Aglobal resource-constrained parallelization technique. InProc., 3rd Internat. Conf. on Supercomputing (Crete, Greece, June), pp. 154–163.
Eckert, J.P., Chu, J.C., Tonik, A.B., and Schmitt, W.F. 1959. Design of UNIVAC-LARC System: I. InProc., Eastern Joint Computer Conf., pp. 59–65.
Ellis, J.R. 1986.Bulldog: A Compiler for VLIW Architectures. MIT Press, Cambridge, Mass.
Fawcett, B.K. 1975. Maximal clocking rates for pipelined digital systems. M.S. thesis, Univ. of Ill., Urbana-Champaign, Ill.
Fernandez, E.B., and Bussel, B. 1973. Bounds on the number of processors and time for multiprocessor optimal schedule.IEEE Trans. Comps., C-22, 8 (Aug.): 745–751.
Fisher, J.A. 1979. The optimization of horizontal microcode within and beyond basic blocks: An application of processor scheduling with resources, Ph.D. thesis, New York Univ., New York.
Fisher, J.A. 1980. 2N-way jump microinstruction hardware and an effective instruction binding method. InProc., 13th Annual Workshop on Microprogramming (Colorado Springs, Colo., Nov.), pp. 64–75.
Fisher, J.A. 1981. Trace scheduling: A technique for global microcode compaction.IEEE Trans. Comps., C-30, 7 (July): 478–490.
Fisher, J.A. 1983. Very long instruction word architectures and the ELI-512. InProc., Tenth Annual Internat. Symp. on Computer Architecture (Stockholm, June), pp. 140–150.
Fisher, J.A. 1992. Trace Scheduling-2, an extension of trace scheduling. Tech. rept., Hewlett-Packard Laboratories.
Fisher, J.A., and Freudenberger, S.M. 1992. Predicting conditional jump directions from previous runs of a program. InProc., Fifth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Oct.), pp. 85–95.
Fisher, J.A., Landskov, D., and Shriver, B.D. 1981. Microcode compaction: Looking backward and looking forward. InProc., 1981 Nat. Computer Conf., pp. 95–102.
Fisher, J.A., Ellis, J.R., Ruttenberg, J.C., and Nicolau, A. 1984. Parallel processing: A smart compiler and a dumb machine. InProc., ACM SIGPLAN '84 Symp. on Compiler Construction (Montreal, June), pp. 37–47.
Floating Point Systems. 1979.FPS AP-120B Processor Handbook. Floating Point Systems, Inc., Beaverton, Ore.
Foster, C.C., and Riseman, E.M. 1972. Percolation of code to enhance parallel dispatching and execution.IEEE Trans. Comps., C-21, 12 (Dec): 1411–1415.
Franklin, M., and Sohi, G.S. 1992. The expandable split window paradigm for exploiting fine-grain parallelism. InProc. 19th Annual International Symp. on Computer Architecture (Gold Coast, Australia, May), pp. 58–67.
Freudenberger, S.M., and Ruttenberg, J.C. 1992. Phase ordering of register allocation and instruction scheduling. InCode Generation—Concepts, Tools, Techniques: Proc., Internat. Workshop on Code Generation, May 1991 (R. Giegerich, and S.L. Graham, eds.), Springer-Verlag, London, pp. 146–172.
Gasperoni, F. 1989. Compilation techniques for VLIW architectures. Tech. rept. RC 14915, IBM Research Div., T.J. Watson Research Center, Yorktown Heights, N.Y.
Gibbons, P.B., and Muchnick, S.S. 1986. Efficient instruction scheduling for a pipelined architecture. InProc., ACM SIGPLAN '86 Symp. on Compiler Construction (Palo Alto, Calif., July), pp. 11–16.
Golumbic, M.C., and Rainish, V. 1990. Instruction schedulig beyond basic blocks.IBM J. Res. and Dev., 34, 1 (Jan.): 93–97.
Gonzalez, M.J. 1977. Deterministic processor scheduling.ACM Computer Surveys, 9, 3 (Sept.): 173–204.
Goodman, J.R., and Hsu, W.-C. 1988. Code scheduling and register allocation in large basic blocks. InProc., 1988 Internat. Conf. on Supercomputing (St. Malo, France, July), pp. 442–452.
Grishman, R., and Su, B. 1983. A preliminary evaluation of trace scheduling for global microcode compaction.IEEE Trans. Comps., C-32, 12 (Dec): 1191–1194.
Gross, T.R., and Hennessy, J.L. 1982. Optimizing delayed branches. InProc., 15th Annual Workshop on Micro-programming (Oct.), pp. 114–120.
Gross, T., and Ward, M. 1990. The suppression of compensation code. InAdvances in Languages and Compilers for Parallel Computing (A. Nicolau, D. Gelernter, T. Gross, and D. Padua, eds.), Pitman/MIT Press, London, pp. 260–273.
Gurd, J., Kirkham, C.C., and Watson, I. 1985. The Manchester prototype dataflow computer.CACM, 28, 1(Jan.): 34–52.
Hallin, T.G., and Flynn, M.J. 1972. Pipelining of arithmetic functions.IEEE Trans. Comps., C-21, 8 (Aug.): 880–886.
Hendren, L.J., Gao, G.R., Altman, E.R., and Mukerji, C. 1992. Register allocation using cyclic interval graphs: A new approach to an old problem. ACAPS Tech. Memo 33, Advanced Computer Architecture and Program Structures Group, McGill Univ., Montreal.
Hennessy, J.L., and Gross, T. 1983. Post-pass code optimization of pipelined constraints.ACM Trans. Programming Languages and Systems, 5, 3 (July): 422–448.
Hennessy, J., Jouppi, N., Baskett, F., Gross, T., and Gill, J. 1982. Hardware/software tradeoffs for increased performance. InProc., Symp. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Mar.) pp. 2–11.
Hennessy, J. Jouppi, N., Przybylski, S., Rowen, C., Gross, T., Baskett, F., and Gill, J. 1982. MIPS: A microprocessor architecture. InProc., 15th Annual Workshop on Microprogramming (Palo Alto, Calif., Oct.), pp. 17–22.
Hintz, R.G., and Tate, D.P. 1972. Control Data STAR-100 processor design. InProc., COMPCON '72 (Sept.), pp. 1–4.
Hsu, P.Y.T. 1986. Highly concurrent scalar processing. Ph.D. thesis, Univ. of Ill., Urbana-Champaign, Ill.
Hsu, P.Y.T., and Davidson, E.S. 1986. Highly concurrent scalar processing. InProc., Thirteenth Annual Internat. Symp. on Computer Architecture, pp. 386–395.
Hsu, W.-C. 1987. Register allocation and code scheduling for load/store architectures. Comp. Sci. Tech. Rept. no. 722, Univ. of Wisc., Madison.
Hu, T.C. 1961. Parallel sequencing and assembly line problems.Operations Research, 9, 6: 841–848.
Hwu, W.W., and Chang, P.P. 1988. Exploiting parallel microprocessor microarchitectures with a compiler code generator. In Proc.,15th Annual Internat. Symp. on Computer Architecture (Honolulu, May), pp. 45–53.
Hwu, W.W., and Patt, Y.N. 1986. HPSm, a high performance restricted data flow architecture having minimal functionality. InProc., 13th Annual Internat. Symp. on Computer Architecture (Tokyo, June), pp. 297–306.
Hwu, W.W., and Patt, Y.N. 1987. Checkpoint repair for out-of-order execution machines.IEEE Trans. Comps., C-36, 12 (Dec): 1496–1514.
Hwu, W.W., Conte, T.M., and Chang, P.P. 1989. Comparing software and hardware schemes for reducing the cost of branches. InProc., 16th Annual Internat. Symp. on Computer Architecture (May), pp. 224–233.
Hwu, W.W., Mahlke, S.A., Chen, W.Y., Chang, P.P., Waiter, N.J., Bringmann, R.A., Ouellette, R.G., Hank, R.E., Kiyohara, T., Haab, G.E., Holm, J.G., and Lavery, D.M. 1993. The superblock: An effective technique for VLIW and superscalar compilation.The J. Supercomputing, 7, 1/2: 229–248.
IBM. 1967.IBM J. Res. and Dev., 11, 1 (Jan.). Special issue on the System/360 Model 91.
IBM. 1976.IBM 3838 Array Processor Functional Characteristics. Pub. no. 6A24-3639-0, file no. S370-08, IBM Corp., Endicott, N.Y.
IBM. 1990.IBM J. Res. and Dev., 34, 1 (Jan.). Special issue on the IBM RISC System/6000 processor.
Intel. 1989a.i860 64-Bit Microprocessor Programmer's Reference Manual. Pub. no. 240329-001, Intel Corp., Santa Clara, Calif.
Intel. 1989b.80960CA User's Manual. Pub. no. 270710-001, Intel Corp., Santa Clara, Calif.
Jain, S. 1991. Circular scheduling: A new technique to perform software pipelining. InProc., ACM SIGPLAN '91 Conf. on Programming Language Design and Implementation (June), pp. 219–228.
Johnson, M. 1991.Superscalar Microprocessor Design. Prentice-Hall, Englewood Cliffs, N.J.
Jouppi, N.P. 1989. The nonuniform distribution of instruction-level and machine parallelism and its effect on performance.IEEE Trans. Comps., C-38, 12 (Dec): 1645–1658.
Jouppi, N.P., and Wall, D. 1989. Available instruction level parallelism for superscalar and superpipelined machines. InProc., Third Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Apr.), pp. 272–282.
Kasahara, H., and Narita, S. 1984. Practical multiprocessor scheduling algorithms for efficient parallel processing.IEEE Trans. Comps., C-33, 11 (Nov.): 1023–1029.
Keller, R.M. 1975. Look-ahead processors.Computing Surveys 7, 4 (Dec): 177–196.
Kleir, R.L. 1974. A representation for the analysis of microprogram operation. InProc., 7th Annual Workshop on Microprogramming (Sept.), pp. 107–118.
Kleir, R.L., and Ramamoorthy, C.V. 1971. Optimization strategies for microprograms.IEEE Trans. Comps., C-20, 7 (July): 783–794.
Kogge, P.M. 1973. Maximal rate pipelined solutions to recurrence programs. InProc., First Annual Symp. on Computer Architecture (Univ. of Fla., Gainesville, Dec), pp. 71–76.
Kogge, P.M. 1974. Parallel solution of recurrence problems.IBM J. Res. and Dev., 18, 2 (Mar.): 138–148.
Kogge, P.M. 1977a. Algorithm development for pipelined processors. InProc., 1977 Internat. Conf. on Parallel Processing (Aug.), p. 217.
Kogge, P.M. 1977b. The microprogramming of pipelined processors. InProc., 4th Annual Symp. on Computer Architecture (Mar.), pp. 63–69.
Kogge, P.M. 1981.The Architecture of Pipelined Computers. McGraw-Hill, New York.
Kogge, P.M., and Stone, H.S. 1973. A parallel algorithm for the efficient solution of a general class of recurrence equations.IEEE Trans. Comps., C-22, 8 (Aug.): 786–793.
Kohler, W.H. 1975. A preliminary evaluation of the critical path method for scheduling tasks on multiprocessor systems.IEEE Trans. Comps., C-24, 12 (Dec): 1235–1238.
Kohn, L., and Margulis, N. 1989. Introducing the Intel i860 64-bit microprocessor.IEEE Micro, 9, 4 (Aug.): 15–30.
Kunkel, S.R., and Smith, J.E. 1986. Optimal pipelining in supercomputers. InProc., 13th Annual Internat. Symp. on Computer Architecture (Tokyo, June), pp. 404–411.
Labrousse, J., and Slavenburg, G.A. 1988. CREATE-LIFE: A design system for high performance VLSI circuits. InProc., Internat. Conf. on Circuits and Devices, pp. 365–360.
Labrousse, J., and Slavenburg, G.A. 1990a. A 50 MHz microprocessor with a VLIW architecture. InProc., ISSCC '90 (San Francisco), pp. 44–45.
Labrousse, J., and Slavenburg, G.A. 1990b. CREATE-LIFE: A modular design approach for high performance ASICs. InProc., COMPCON '90 (San Francisco), pp. 427–433.
Lam, M.S.-L. 1987. A systolic array optimizing compiler. Ph.D. thesis, Carnegie Mellon Univ., Pittsburgh.
Lam. M. 1988. Software pipelining: An effective scheduling technique for VLIW machines. InProc., ACM SIGPLAN '88 Conf. on Programming Language Design and Implementation (Atlanta, June), pp. 318–327.
Lam, M.S., and Wilson, R.P. 1992. Limits of control flow on parallelism. InProc., Nineteenth Internat. Symp. on Computer Architecture (Gold Coast, Australia, May), pp. 46–57.
Landskov, D., Davidson, S., Shriver, B., and Mallett, P.W. 1980. Local microcode compaction techniques.ACM Computer Surveys, 12, 3 (Sept.): 261–294.
Lee, J.K.F., and Smith, A.J. 1984. Branch prediction strategies and branch target buffer design.Computer, 17, 1 (Jan.): 6–22.
Lee, M., Tirumalai, P.P., and Ngai, T.-F. 1993. Software pipelining and superblock scheduling: Compilation techniques for VLIW machines. InProc., 26th Annual Hawaii Internat. Conf. on System Sciences (Hawaii, Jan.), vol. 1, pp. 202–213.
Linn, J.L. 1988. Horizontal microcode compaction. InMicroprogramming and Firmware Engineering Methods (S. Habib, ed.), Van Nostrand Reinhold, New York, pp. 381–431.
Lowney, P.G., Freudenberger, S.M., Karzes, T.J., Lichtenstein, W.D., Nix, R.P., O'Donnell, J.S., and Ruttenburg, J.C. 1993. The Multiflow trace scheduling compiler.The J. Supercomputing, 7, 1/2: 51–142.
Mahlke, S.A., Chen, W.Y., Hwu, W.W., Rau, B.R., and Schlansker, M.S. 1992. Sentinel scheduling for VLIW and superscalar processors. InProc., Fifth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Oct.), pp. 238–247.
Mahlke, S.A., Lin, D.C., Chen, W.Y., Hank, R.E., and Bringmann, R.A. 1992. Effective compiler support for predicated execution using the hyperblock. InProc., 25th Annual Internat. Symp. on Microarchitecture (Dec), pp. 45–54.
Mallett, P.W. 1978. Methods of compacting microprograms. Ph.D. thesis, Univ. of Southwestern La., Lafayette, La.
Mangione-Smith, W., Abraham, S.G., and Davidson, E.S. 1992. Register requirements of pipelined processors. InProc., Internat. Conf. on Supercomputing (Washington, D.C., July).
McFarling, S., and Hennessy, J. 1986. Reducing the cost of branches. InProc., Thirteenth Internat. Symp. on Computer Architecture (Tokyo, June), pp. 396–403.
Moon, S.-M., Ebcioglu, K. 1992. An efficient resource-constrained global scheduling technique for superscalar and VLIW processors. InProc., 25th Annual Internat. Symp. on Microarchitecture (Portland, Ore., Dec), pp. 55–71.
Nakatani, T., and Ebcioglu, K. 1990. Using a lookahead window in a compaction-based parallelizing compiler. InProc., 23rd Annual Workshop on Microprogramming and Microarchitecture (Orlando, Fla., Nov.), pp. 57–68.
Nicolau, A. 1984. Parallelism, memory anti-aliasing and correctness for trace scheduling compilers. Ph.D. thesis, Yale Univ., New Haven, Conn.
Nicolau, A. 1985a. Percolation scheduling: A parallel compilation technique. Tech. Rept. TR 85-678, Dept. of Comp. Sci., Cornell, Ithaca, N.Y.
Nicolau, A. 1985b. Uniform parallelism exploitation in ordinary programs. InProc., Internat. Conf. on Parallel Processing (Aug.), pp. 614–618.
Nicolau, A., and Fisher, J.A. 1981. Using an oracle to measure parallelism in single instruction stream programs. InProc., Fourteenth Annual Microprogramming Workshop (Oct.), pp. 171–182.
Nicolau, A., and Fisher, J.A. 1984. Measuring the parallelism available for very long instruction word architectures.IEEE Trans. Comps., C-33, 11 (Nov.): 968–976.
Nicolau, A., and Potasman, R. 1990. Realistic scheduling: Compaction for pipelined architectures. InProc., 23rd Annual Workshop on Microprogramming and Microarchitecture (Orlando, Fla., Nov.), pp. 69–79.
Oehler, R.R., and Blasgen, M.W. 1991. IBM RISC System/6000: Architecture and performance.IEEE Micro, 11, 3 (June): 14.
Papadopoulos, G.M., and Culler, D.E. 1990. Monsoon: An explicit token store architecture. InProc., Seventeenth Internat. Symp. on Computer Architecture (Seattle, May), pp. 82–91.
Park, J.C.H., and Schlansker, M.S. 1991. On predicated execution. Tech. Rept. HPL-91-58, Hewlett Packard Laboratories.
Patel, J.H. 1976. Improving the throughput of pipelines with delays and buffers. Ph.D. thesis, Univ. of Ill., Urbana-Champaign, Ill.
Patel, J.H., and Davidson, E.S. 1976. Improving the throughput of a pipeline by insertion of delays. InProc., 3rd Annual Symp. on Computer Architecture (Jan.), pp. 159–164.
Patterson, D.A., and Sequin, C.H. 1981. RISC I: A reduced instruction set VLSI computer. InProc., 8th Annual Symp. on Computer Architecture (Minneapolis, May), pp. 443–450.
Peterson, C., Sutton, J., and Wiley, P., 1991. iWarp: A 100-MOPS, LIW microprocessor for multicomputers.IEEE Micro, 11, 3 (June): 26.
Popescu, V., Schultz, M., Spracklen, J., Gibson, G., Lightner, B., and Isaman, D. 1991. The Metaflow architecture.IEEE Micro, 11, 3 (June): 10.
Radin, G. 1982. The 801 minicomputer. InProc., Symp. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Mar.), pp. 39–47.
Ramakrishnan, S. 1992. Software pipelining in PA-RISC compilers.Hewlett-Packard J. (July): 39–45.
Ramamoorthy, C.V., and Gonzalez, M.J. 1969. A survey of techniques for recognizing parallel processable streams in computer programs. InProc., AFIPS Fall Joint Computing Conf., pp. 1–15.
Ramamoorthy, C.V., and Tsuchiya, M. 1974. A high level language for horizontal microprogramming.IEEE Trans. Comps., C-23: 791–802.
Ramamoorthy, C.V, Chandy, K.M., and Gonzalez, M.J. 1972. Optimal scheduling strategies in a multiprocessor system.IEEE Trans. Comps., C-21, 2 (Feb.): 137–146.
Rau, B.R. 1988. Cydra 5 Directed Dataflow architecture. InProc., COMPCON '88 (San Francisco, Mar.), pp. 106–113.
Rau, B.R. 1992. Data flow and dependence analysis for instruction level parallelism. InFourth Internat. Workshop on Languages and Compilers for Parallel Computing (U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, eds.), Springer-Verlag, pp. 236–250.
Rau, B.R., and Glaeser, CD. 1981. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. InProc., Fourteenth Annual Workshop on Microprogramming (Oct.), pp. 183–198.
Rau, B.R., Glaeser, C.D., and Greenawalt, E.M. 1982. Architectural support for the efficient generation of code for horizontal architectures. InProc., Symp. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Mar.), pp. 96–99.
Rau, B.R., Glaeser, CD., and Picard, R.L. 1982. Efficient code generation for horizontal architectures: Compiler techniques and architectural support. InProc., Ninth Annual Internat. Symp. on Computer Architecture (Apr.), pp. 131–139.
Rau, B.R., Lee, M., Tirumalai, P., and Schlansker, M.S. 1992. Register allocation for software pipelined loops. InProc., SIGPLAN '92 Conf. on Programming Language Design and Implementation (San Francisco, June 17–19), pp. 283–299.
Rau, B.R., Yen, D.W.L., Yen, W., and Towle, R.A. 1989. The Cydra 5 departmental supercomputer: Design philosophies, decisions and trade-offs.Computer, 22, 1 (Jan.): 12–34.
Riseman, E.M., and Foster, C.C. 1972. The inhibition of potential parallelism by conditional jumps.IEEE Trans. Comps., C-21, 12 (Dec): 1405–1411.
Ruggiero, J.F., and Coryell, D. A. 1969. An auxiliary processing system for array calculations.IBM Systems J., 8, 2: 118–135.
Russell, R.M. 1978. The CRAY-1 computer system.CACM, 21: 63–72.
Rymarczyk, J. 1982. Coding guidelines for pipelined processors. InProc., Symp. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Mar.), pp, 12–19.
Schmidt, U., and Caesar, K. 1991. Datawave: A single-chip multiprocessor for video applications.IEEE Micro, 11, 3 (June): 22.
Schneck, P.B. 1987.Supercomputer Architecture. Kluwer Academic, Norwell, Mass.
Schuette, M.A., and Shen, J.P. 1993. Instruction-level experimental evaluation of the Multiflow TRACE 14/300 VLIW computer.The J. Supercomputing, 7, 1/2: 249–271.
Sethi, R. 1975. Complete register allocation problems.SIAM J. Computing, 4, 3: 226–248.
Sethi, R., and Ullman, J.D. 1970. The generation of optimal code for arithmetic expressions,JACM, 17, 4 (Oct.): 715–728.
Sites, R.L. 1978. Instruction ordering for the CRAY-1 computer. Tech. rept. 78-CS-023, Univ. of Calif., San Diego.
Smith, J.E. 1981. A study of branch prediction strategies. InProc., Eighth Annual Internat. Symp. on Computer Architecture (May), pp. 135–148.
Smith, J.E. 1982. Decoupled access/execute architectures. InProc., Ninth Annual Internat. Symp. on Computer Architecture (Apr.), pp. 112–119.
Smith, J.E. 1989. Dynamic instruction scheduling and the Astronautics ZS-1.Computer, 22, 1 (Jan.): 21–35.
Smith, J.E., and Pleszkun, A.R. 1988. Implementing precise interrupts in pipelined processors.IEEE Trans. Comps., C-37, 5 (May): 562–573.
Smith, J.E., Dermer, G.E., Vanderwarn, B.D., Klinger, S.D., Roszewski, CM., Fowler, D.L., Scidmore, K.R., and Laudon, J.P. 1987. The ZS-1 central processor.In Proc., Second Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Oct.), pp. 199–204.
Smith, M.D., Horowitz, M., and Lam, M. 1992. Efficient superscalar performance through boosting. In Proc.,Fifth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Oct.), pp. 248–259.
Smith, M.D., Lam, M.S., and Horowitz, M.A. 1990. Boosting beyond static scheduling in a superscalar processor. InProc., Seventeenth Internat. Symp. on Computer Architecture (June), pp. 344–354.
Smotherman, M., Krishnamurthy, S., Aravind, P.S., and Hunnicutt, D. 1991. Efficient DAG construction and heuristic calculation for instruction scheduling. InProc., 24th Annual Internat. Workshop on Microarchitecture (Albuquerque, N.M., Nov.), pp. 93–102.
Sohi, G.S., and Vajapayem, S. 1987. Instruction issue logic for high-performance, interruptable pipelined processors. InProc., 14th Annual Symp. on Computer Architecture (Pittsburgh, June), pp. 27–36.
Su, B., and Ding, S. 1985. Some experiments in global microcode compaction. InProc., 18th Annual Workshop on Microprogramming (Asilomar, Calif., Nov.), pp. 175–180.
Su, B., and Wang, J. 1991a. GURPR*: A new global software pipelining algorithm. InProc., 24th Annual Internat. Symp. on Microarchitecture (Albuquerque, N.M., Nov.), pp. 212–216.
Su, B., and Wang, J. 1991b. Loop-carried dependence and the general URPR software pipelining approach. InProc., 24th Annual Hawaii Internat. Conf. on System Sciences (Hawaii, Jan.).
Su, B., Ding, S., and Jin, L. 1984. An improvement of trace scheduling for global microcode compaction. InProc., 17th Annual Workshop on Microprogramming (New Orleans, Oct.), pp. 78–85.
Su, B., Ding, S., and Xia, J. 1986. URPR—An extension of URCR for software pipelining. InProc., 19th Annual Workshop on Microprogramming (New York, Oct.), pp. 104–108.
Su, B., Ding, S., Wang, J., and Xia, J. 1987. GURPR—A method for global software pipelining. InProc., 20th Annual Workshop on Microprogramming (Colorado Springs, Colo., Dec), pp. 88–96.
Thistle, M.R., and Smith, B.J. 1988. A processor architecture for Horizon. InProc., Supercomputing '88, (Orlando, Fla., Nov.), pp. 35–41.
Thomas, A.T., and Davidson, E.S. 1974. Scheduling of multiconfigurable pipelines. InProc., 12th Annual Allerton Conf. on Circuits and Systems Theory (Allerton, Ill.), pp. 658–669.
Thornton, J.E. 1964. Parallel operation in the Control Data 6600. InProc., AFIPS Fall Joint Computer Conf., pp. 33–40.
Thornton, J.E. 1970.Design of a Computer—The Control Data 6600. Scott, Foresman, Glenview, Ill.
Tirumalai, P., Lee, M., and Schlansker, M.S. 1990. Parallelization of loops with exits on pipelined architectures. InProc., Supercomputing '90 (Nov.), pp. 200–212.
Tjaden, G.S., and Flynn, M.J. 1970. Detection and parallel execution of parallel instructions.IEEE Trans. Comps., C-19, 10 (Oct.): 889–895.
Tjaden, G.S., and Flynn, M.J. 1973. Representation of concurrency with ordering matrices.IEEE Trans. Comps., C-22, 8 (Aug.): 752–761.
Tokoro, M., Tamura, E., and Takizuka, T. 1981. Optimization of microprograms.IEEE Trans. Comps., C-30, 7 (July): 491–504.
Tokoro, M., Takizuka, T., Tamura, E., and Yamaura, I. 1978. A technique of global optimization of microprograms. InProc., 11th Annual Workshop on Microprogramming (Asilotnar, Calif., Nov.), pp. 41–50.
Tokoro, M., Tamura, E., Takase, K., and Tamaru, K. 1977. An approach to microprogram optimization considering resource occupancy and instruction formats. InProc., 10th Annual Workshop on Microprogramming (Niagara Falls, N.Y., Nov.), pp. 92–108.
Tomasulo, R.M. 1967. An efficient algorithm for exploiting multiple arithmetic units.IBM J. Res. and Dev., 11, 1 (Jan.): 25–33.
Touzeau, R.F. 1984. A FORTRAN compiler for the FPS-164 scientific computer. InProc., ACM S1GPLAN '84 Symp. on Compiler Construction (Montreal), pp. 48–57.
Tsuchiya, M., and Gonzalez, M.J. 1974. An approach to optimization of horizontal microprograms. InProc., Seventh Annual Workshop on Microprogramming (Palo Alto, Calif.), pp. 85–90.
Tsuchiya, M., and Gonzalez, M.J. 1976. Toward optimization of horizontal microprograms,IEEE Trans. Comps., C-25, 10 (Oct.): 992–999.
Uht, A.K. 1986. An efficient hardware algorithm to extract concurrency from general-purpose code. InProc., Nineteenth Annual Hawaii Conf. on System Sciences (Jan.), pp. 41–50.
Wall, D.W. 1991. Limits of instruction-level parallelism. InProc., Fourth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Santa Clara, Calif., Apr.), pp. 176–188.
Warren, H.S. 1990. Instruction scheduling for the IBM RISC System/6000 processor.IBM J. Res. and Dev., 34, 1 (Jan.): 85–92.
Waiter, N.J., Bockhaus, J.W., Haab, G.E., and Subramanian, K. 1992. Enhanced modulo scheduling for loops with conditional branches. InProc., 25th Annual Internat. Symp. on Microarchitecture (Portland, Ore., Dec), pp. 170–179.
Watson, W.J. 1972. The TI ASC—A highly modular and flexible super computer architecture. InProc., AFIPS Fall Joint Computer Conf., pp. 221–228.
Wedig, R.G. 1982. Detection of concurrency in directly executed language instruction streams. Ph.D. thesis, Stanford Univ., Stanford, Calif.
Weiss, S., and Smith, J.E. 1984. Instruction issue logic for pipelined supercomputers. InProc., 11th Annual Internat. Symp. on Computer Architecture, pp. 110–118.
Weiss, S., and Smith, J.E. 1987. A study of scalar compilation techniques for pipelined supercomputers. InProc., Second Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Oct.), pp. 105–109.
Wilkes, M.V. 1951. The best way to design an automatic calculating machine. InProc., Manchester Univ. Comp. Inaugural Conf. (Manchester, England, July), pp. 16–18.
Wilkes, M.V., and Stringer, J.B. 1953. Microprogramming and the design of the control circuits in an electronic digital computer. InProc., The Cambridge Philosophical Society, Part 2 (Apr.), pp. 230–238.
Wolfe, A., and Shen, J.P. 1991. A variable instruction stream extension to the VLIW architecture. InProc., Fourth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Santa Clara, Calif., Apr.), pp. 2–14.
Wood, G. 1978. On the packing of micro-operations into micro-instruction words. InProc., 11th Annual Workshop on Microprogramming (Asilomar, Calif., Nov.), pp. 51–55.
Wood, G. 1979. Global optimization of microprograms through modular control constructs. InProc., 12th Annual Workshop on Microprogramming (Hershey, Penn.), pp. 1–6.
Yau, S.S., Schowe, A.C. and Tsuchiya, M. 1974. On storage optimization of horizontal microprograms. InProc., Seventh Annual Workshop on Microprogramming (Palo Alto, Calif.), pp. 98–106.
Yeh, T.Y., and Patt, Y.N. 1992. Alternative implementations of two-level adaptive branch prediction. InProc., Nineteenth Internat. Symp. on Comp. Architecture (Gold Coast, Australia, May), pp. 124–134.
Zima, H., and Chapman, B. 1990.Supercompilers for Parallel and Vector Computers. Addison-Wesley, Reading, Mass.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Rau, B.R., Fisher, J.A. Instruction-level parallel processing: History, overview, and perspective. J Supercomput 7, 9–50 (1993). https://doi.org/10.1007/BF01205181
Issue Date:
DOI: https://doi.org/10.1007/BF01205181