Instruction-level parallel processing: History, overview, and perspective

Rau, B. Ramakrishna; Fisher, Joseph A.

doi:10.1007/BF01205181

Instruction-level parallel processing: History, overview, and perspective

Published: May 1993

Volume 7, pages 9–50, (1993)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

B. Ramakrishna Rau¹ &
Joseph A. Fisher¹

590 Accesses
188 Citations
3 Altmetric
Explore all metrics

Abstract

Instruction-level parallelism (ILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a much more significant force in computer design. Several systems were built and sold commercially, which pushed ILP far beyond where it had been before, both in terms of the amount of ILP offered and in the central role ILP played in the design of the system. By the end of the decade, advanced microprocessor design at all major CPU manufacturers had incorporated ILP, and new techniques for ILP had become a popular topic at academic conferences. This article provides an overview and historical perspective of the field of ILP and its development over the past three decades.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Acosta, R.D., Kjelstrup, J., and Torng, H.C. 1986. An instruction issuing approach to enhancing performance in multiple function unit processors.IEEE Trans. Comps., C-35, 9 (Sept.): 815–828.
Google Scholar
Adam, T.L., Chandy, K.M., and Dickson, J.R. 1974. A comparison of list schedules for parallel processing systems.CACM, 17, 12 (Dec.): 685–690.
Google Scholar
Advanced Micro Devices. 1989.Am29000 Users Manual. Pub. no. 10620B, Advanced Micro Devices, Sunnyvale, Calif.
Google Scholar
Agerwala, T. 1976. Microprogram optimization: A survey.IEEE Trans. Comps., C-25, 10 (Oct.): 962–973.
Google Scholar
Agerwala, T., and Cocke, J. 1987. High performance reduced instruction set processors. Tech. rept. RC12434 (#55845), IBM Thomas J. Watson Research Center, Yorktown Heights, N.Y.
Google Scholar
Aho, A.V., and Johnson, S.C. 1976. Optimal code generation for expression trees.JACM, 23 3 (July): 488–501.
Google Scholar
Aho, A.V., Johnson, S.C., and Ullman, J.D. 1977a. Code generation for expressions with common subexpressions.JACM, 24, 1 (Jan.): 146–160.
Google Scholar
Aho, A.V., Johnson, S.C., and Ullman, J.D. 1977b. Code generation for machines with multiregister operations. InProc., Fourth ACM Symp. on Principles of Programming Languages, pp. 21–28.
Aiken, A., and Nicolau, A. 1988a. Optimal loop parallelization. InProc., SIGPLAN'88 Conf. on Programming Language Design and Implementation (Atlanta, June), pp. 308–317.
Aiken, A., and Nicolau, A. 1988b. Perfect pipelining: A new loop parallelization technique. InProc., 1988 European Symp. on Programming, Springer Verlag, New York, pp. 221–235.
Google Scholar
Aiken, A., and Nicolau, A. 1991. A realistic resource-constrained software pipelining algorithm. InAdvances in Languages and Compilers for Parallel Processing (A. Nicolau, D. Gelernter, T. Gross, and D. Padua, eds.), Pitman/MIT Press, London, pp. 274–290.
Google Scholar
Allen, J.R., Kennedy, K., Porterfield, C., and Warren, J. 1983. Conversion of control dependence to data dependence. InProc., Tenth Annual ACM Symp. on Principles of Programming Languages (Jan.): pp. 177–189.
Google Scholar
Anderson, D.W., Sparacio, F.J., and Tomasulo, R.M. 1967. The System/360 Model 91: Machine philosophy and instruction handling.IBM J. Res. and Dev., 11, 1 (Jan.): 8–24.
Google Scholar
Apollo Computer. 1988.The Series 10000 Personal Supercomputer: Inside a New Architecture. Publication no. 002402-007 2-88, Apollo Computer, Inc., Chelmsford, Mass.
Google Scholar
Arvind and Gostelow, K. 1982. The U-interpreter.Computer, 15, 2 (Feb.): 12–49.
Google Scholar
Arvind and Kathail, V. 1981. A multiple processor dataflow machine that supports generalised procedures. InProc., Eighth Annual Symp. on Computer Architecture (May): pp. 291–302.
Google Scholar
Auslander, M., and Hopkins, M. 1982. An overview of the PL.8 compiler. InProc., ACM SIGPLAN Symp. on Compiler Construction (Boston, June), pp. 22–31.
Bahr, R., Ciavaglia, S., Flahive, B., Kline, M., Mageau, P., and Nickel, D. 1991. The DN10000TX: A new high-performance PRISM processor. InProc., COMPCON '91, pp. 90–95.
Baker, K.R. 1974.Introduction to Sequencing and Scheduling. John Wiley, New York.
Google Scholar
Beck, G.R., Yen, D.W.L., and Anderson, T.L. 1993. The Cydra 5 minisupercomputer: Architecture and implementation.The J. Supercomputing, 7, 1/2: 143–180.
Google Scholar
Bell, C.G., and Newell, A. 1971.Computer Structures: Readings and Examples. McGraw-Hill, New York.
Google Scholar
Bernstein, D., and Rodeh, M. 1991. Global instruction scheduling for superscalar machines. InProc., SIGPLAN '91 Conf. on Programming Language Design and Implementation (June), pp. 241–255.
Google Scholar
Bernstein, D., Cohen, D., and Krawczyk, H. 1991. Code duplication: An assist for global instruction scheduling. InProc., 24th Annual Internat. Symp. on Microarchitecture (Albuquerque, N.Mex.), pp. 103–113.
Blanck, G., and Krueger, S. 1992. The SuperSPARCTM microprocessor. InProc., COMPCON '92, pp. 136–141.
Bloch, E. 1959. The engineering design of the STRETCH computer. InProc., Eastern Joint Computer Conf, pp. 48–59.
Bruno, J.L., and Sethi, R. 1976. Code generation for a one-register machine.JACM, 23, 3 (July): 502–510.
Google Scholar
Buchholz, W., ed. 1962.Planning a Computer System: Project Stretch. McGraw-Hill, New York.
Google Scholar
Butler, M., Yeh, T., Patt., Y., Alsup, M., Scales, H., and Shebanow, M. 1991. Single instruction stream parallelism is greater than two. InProc., Eighteenth Annual Internat. Symp. on Computer Architecture (Toronto), pp. 276–286.
Callahan, D., and Koblenz, B. 1991. Register allocation via hierarchical graph coloring. InProc., SIGPLAN '91 Conf. on Programming Language Design and Implementation (Toronto, June), pp. 192–203.
Callahan, D., Carr, S., and Kennedy, K. 1990. Improving register allocation for subscripted variables. InProc., ACM SIGPLAN '90 Conf. on Programming Language Design and Implementation, (White Plains, N.Y, June), pp. 53–65.
Carpenter, B.E., and Doran, R.W., eds. 1986.A.M. Turing's ACE Report of 1946 and Other Papers. MIT Press, Cambridge, Mass.
Google Scholar
Chaitin, G.J. 1982. Register allocation and spilling via graph coloring. InProc., ACM SIGPLAN Symp. on Compiler Construction (Boston, June), pp. 98–105.
Chang, P.P., and Hwu, W.W. 1988. Trace selection for compiling large C application programs to microcode. InProc., 21st Annual Workshop on Microprogramming and Microarchitectures (San Diego, Nov.), pp. 21–29.
Chang, P.P., and Hwu, W.W. 1992. Profile-guided automatic inline expansion for C programs.Software—Practice and Experience, 22, 5 (May): 349–376.
Google Scholar
Chang, P.P., Lavery, D.M., and Hwu, W.W. 1991. The importance of prepass code scheduling for superscalar and superpipelined processors. Tech. Rept. no. CRHC-91-18, Center for Reliable and High-Performance Computing, Univ. of Ill, Urbana-Champaign, Ill.
Google Scholar
Chang, P.P., Mahlke, S.A., Chen, W.Y., Warter, N.J., and Hwu, W.W. 1991. IMPACT: An architectural framework for multiple-instruction-issue processors. InProc., 18th Annual Internat. Symp. on Computer Architecture (Toronto, May), pp. 266–275.
Charlesworth, A.E. 1981. An approach to scientific array processing: The architectural design of the AP-120B/ FPS-164 family.Computer, 14, 9: 18–27.
Google Scholar
Chen, T.C. 1971. Parallelism, pipelining, and computer efficiency.Computer Design, 10, 1 (Jan.): 69–74,
Google Scholar
Chen, T.C. 1975. Overlap and pipeline processing. InIntroduction to Computer Architecture (H.S. Stone, ed.), Science Research Associates, Chicago, pp. 375–431.
Google Scholar
Chow, F., and Hennessy, J. 1984. Register allocation by priority-based coloring. InProc., ACM SIGPLAN Symp. on Compiler Construction (Montreal, June), pp. 222–232.
Chow, F.C., and Hennessy, J.L. 1990. The priority-based coloring approach to register allocation.ACM Trans. Programming Languages and Systems, 12 (Oct.): 501–536.
Google Scholar
Coffman, J.R., ed. 1976.Computer and Job-Shop Scheduling Theory. John Wiley, New York.
Google Scholar
Coffman, E.G., and Graham, R.L. 1972. Optimal scheduling for two processor systems.Acta Informatica, 1, 3: 200–213.
Google Scholar
Cohen, D. 1978. A methodology for programming a pipeline array processor. InProc., 11th Annual Microprogramming Workshop (Asilomar, Calif., Nov.), pp. 82–89.
Colwell, R.P., Nix, R.P., O'Donnell, J.J., Papworth, D.B., and Rodman, P.K. 1988. A VLIW architecture for a trace scheduling compiler.IEEE Trans. Comps., C-37, 8 (Aug.): 967–979.
Google Scholar
Colwell, R.P., Hall, W.E., Joshi, C.S., Papworth, D.B., Rodman, P.K., and Tornes, J.E. 1990. Architecture and implementation of a VLIW supercomputer. InProc., Supercomputing '90 (Nov.), pp. 910–919.
Google Scholar
Cotten, L.W. 1965. Circuit implementation of high-speed pipeline systems. InProc., AFIPS Fall Joint Computing Conf., pp. 489–504.
Cotten, L.W. 1969. Maximum-rate pipeline systems. InProc., AFIPS Spring Joint Computing Conf., 581–586.
Danelutto, M., and Vanneschi, M. 1990. VLIW in-the-large: A model for fine grain parallelism exploitation of distributed memory multiprocessors. InProc., 23rd Annual Workshop on Microprogramming and Microarchitecture (Nov.), pp. 7–16.
Google Scholar
Dasgupta, S., and Tartar, J. 1976. The identification of maximal parallelism in straight-line microprograms.IEEE Trans. Comps., C-25, 10 (Oct.): 986–991.
Google Scholar
Davidson, E.S. 1971. The design and control of pipelined function generators. InProc., 1971 Internat. IEEE Conf. on Systems, Networks, and Computers (Oaxtepec, Mexico, Jan.), pp. 19–21.
Davidson, E.S. 1974. Scheduling for pipelined processors. InProc., 7th Hawaii Conf. on Systems Sciences, pp. 58–60.
Davidson, S., Landskov, D., Shriver, B.D., and Mallett, P.W. 1981. Some experiments in local microcode compaction for horizontal machines.IEEE Trans. Comps., C-30, 7: 460–477.
Google Scholar
Davidson, E.S., Shar, L.E., Thomas, A.T., and Patel, J.H. 1975. Effective control for pipelined computers. InProc., COMPCON '90 (San Francisco, Feb.), pp. 181–184.
Dehnert, J.C., and Towle, R.A. 1993. Compiling for the Cydra 5.The J. Supercomputing, 7, 1/2: 181–227.
Google Scholar
Dehnert, J.C., Hsu, P.Y.-T., and Bratt, J.P. 1989. Overlapped loop support in the Cydra 5. InProc., Third Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Apr.), pp. 26–38.
DeLano, E., Walker, W., Yetter, J., and Forsyth, M. 1992. A high speed superscalar PA-RISC processor.In Proc., COMPCON '92 (Feb.), pp. 116–121.
DeWitt, DJ. 1975. A control word model for detecting conflicts between microprograms. InProc., 8th Annual Workshop on Microprogramming (Chicago, Sept.), pp. 6–12.
Diefendorff, K., and Allen, M. 1992. Organization of the Motorola 88110 superscalar RISC microprocessor.IEEE Micro, 12, 2 (Apr.): 40–63.
Google Scholar
Dongarra, J.J. 1986, A survey of high performance computers. InProc., COMPCON '86 (Mar.), pp. 8–11.
Dwyer, H., and Torng, H.C. 1992. An out-of-order superscalar processor with speculative execution and fast, precise interrupts. InProc., 25th Annual Internat. Symp. on Microarchitecture (Portland, Ore., Dec.), pp. 272–281.
Ebcioglu, K. 1988. Some design ideas for a VLIW architecture for sequential-natured software. InParallel Processing (Proc., IFIP WG 10.3 Working Conf. on Parallel Processing, Pisa, Italy) (M. Cosnard, M.H. Barton, and M. Vanneschi, eds.), North-Holland, pp. 3–21.
Ebcioglu, K., and Nakatani, T. 1989. A new compilation technique for parallelizing loops with unpredictable branches on a VLIW architecture. InLanguages and Compilers for Parallel Computing (D. Gelernter, A. Nicolau, and D. Padua, eds.), Pitman/MIT Press, London, pp. 213–229.
Google Scholar
Ebcioglu, K., and Nicolau, A. 1989. Aglobal resource-constrained parallelization technique. InProc., 3rd Internat. Conf. on Supercomputing (Crete, Greece, June), pp. 154–163.
Eckert, J.P., Chu, J.C., Tonik, A.B., and Schmitt, W.F. 1959. Design of UNIVAC-LARC System: I. InProc., Eastern Joint Computer Conf., pp. 59–65.
Ellis, J.R. 1986.Bulldog: A Compiler for VLIW Architectures. MIT Press, Cambridge, Mass.
Google Scholar
Fawcett, B.K. 1975. Maximal clocking rates for pipelined digital systems. M.S. thesis, Univ. of Ill., Urbana-Champaign, Ill.
Google Scholar
Fernandez, E.B., and Bussel, B. 1973. Bounds on the number of processors and time for multiprocessor optimal schedule.IEEE Trans. Comps., C-22, 8 (Aug.): 745–751.
Google Scholar
Fisher, J.A. 1979. The optimization of horizontal microcode within and beyond basic blocks: An application of processor scheduling with resources, Ph.D. thesis, New York Univ., New York.
Google Scholar
Fisher, J.A. 1980. 2^N-way jump microinstruction hardware and an effective instruction binding method. InProc., 13th Annual Workshop on Microprogramming (Colorado Springs, Colo., Nov.), pp. 64–75.
Fisher, J.A. 1981. Trace scheduling: A technique for global microcode compaction.IEEE Trans. Comps., C-30, 7 (July): 478–490.
Google Scholar
Fisher, J.A. 1983. Very long instruction word architectures and the ELI-512. InProc., Tenth Annual Internat. Symp. on Computer Architecture (Stockholm, June), pp. 140–150.
Fisher, J.A. 1992. Trace Scheduling-2, an extension of trace scheduling. Tech. rept., Hewlett-Packard Laboratories.
Fisher, J.A., and Freudenberger, S.M. 1992. Predicting conditional jump directions from previous runs of a program. InProc., Fifth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Oct.), pp. 85–95.
Fisher, J.A., Landskov, D., and Shriver, B.D. 1981. Microcode compaction: Looking backward and looking forward. InProc., 1981 Nat. Computer Conf., pp. 95–102.
Fisher, J.A., Ellis, J.R., Ruttenberg, J.C., and Nicolau, A. 1984. Parallel processing: A smart compiler and a dumb machine. InProc., ACM SIGPLAN '84 Symp. on Compiler Construction (Montreal, June), pp. 37–47.
Floating Point Systems. 1979.FPS AP-120B Processor Handbook. Floating Point Systems, Inc., Beaverton, Ore.
Google Scholar
Foster, C.C., and Riseman, E.M. 1972. Percolation of code to enhance parallel dispatching and execution.IEEE Trans. Comps., C-21, 12 (Dec): 1411–1415.
Google Scholar
Franklin, M., and Sohi, G.S. 1992. The expandable split window paradigm for exploiting fine-grain parallelism. InProc. 19th Annual International Symp. on Computer Architecture (Gold Coast, Australia, May), pp. 58–67.
Freudenberger, S.M., and Ruttenberg, J.C. 1992. Phase ordering of register allocation and instruction scheduling. InCode Generation—Concepts, Tools, Techniques: Proc., Internat. Workshop on Code Generation, May 1991 (R. Giegerich, and S.L. Graham, eds.), Springer-Verlag, London, pp. 146–172.
Google Scholar
Gasperoni, F. 1989. Compilation techniques for VLIW architectures. Tech. rept. RC 14915, IBM Research Div., T.J. Watson Research Center, Yorktown Heights, N.Y.
Google Scholar
Gibbons, P.B., and Muchnick, S.S. 1986. Efficient instruction scheduling for a pipelined architecture. InProc., ACM SIGPLAN '86 Symp. on Compiler Construction (Palo Alto, Calif., July), pp. 11–16.
Golumbic, M.C., and Rainish, V. 1990. Instruction schedulig beyond basic blocks.IBM J. Res. and Dev., 34, 1 (Jan.): 93–97.
Google Scholar
Gonzalez, M.J. 1977. Deterministic processor scheduling.ACM Computer Surveys, 9, 3 (Sept.): 173–204.
Google Scholar
Goodman, J.R., and Hsu, W.-C. 1988. Code scheduling and register allocation in large basic blocks. InProc., 1988 Internat. Conf. on Supercomputing (St. Malo, France, July), pp. 442–452.
Grishman, R., and Su, B. 1983. A preliminary evaluation of trace scheduling for global microcode compaction.IEEE Trans. Comps., C-32, 12 (Dec): 1191–1194.
Google Scholar
Gross, T.R., and Hennessy, J.L. 1982. Optimizing delayed branches. InProc., 15th Annual Workshop on Micro-programming (Oct.), pp. 114–120.
Gross, T., and Ward, M. 1990. The suppression of compensation code. InAdvances in Languages and Compilers for Parallel Computing (A. Nicolau, D. Gelernter, T. Gross, and D. Padua, eds.), Pitman/MIT Press, London, pp. 260–273.
Google Scholar
Gurd, J., Kirkham, C.C., and Watson, I. 1985. The Manchester prototype dataflow computer.CACM, 28, 1(Jan.): 34–52.
Google Scholar
Hallin, T.G., and Flynn, M.J. 1972. Pipelining of arithmetic functions.IEEE Trans. Comps., C-21, 8 (Aug.): 880–886.
Google Scholar
Hendren, L.J., Gao, G.R., Altman, E.R., and Mukerji, C. 1992. Register allocation using cyclic interval graphs: A new approach to an old problem. ACAPS Tech. Memo 33, Advanced Computer Architecture and Program Structures Group, McGill Univ., Montreal.
Google Scholar
Hennessy, J.L., and Gross, T. 1983. Post-pass code optimization of pipelined constraints.ACM Trans. Programming Languages and Systems, 5, 3 (July): 422–448.
Google Scholar
Hennessy, J., Jouppi, N., Baskett, F., Gross, T., and Gill, J. 1982. Hardware/software tradeoffs for increased performance. InProc., Symp. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Mar.) pp. 2–11.
Hennessy, J. Jouppi, N., Przybylski, S., Rowen, C., Gross, T., Baskett, F., and Gill, J. 1982. MIPS: A microprocessor architecture. InProc., 15th Annual Workshop on Microprogramming (Palo Alto, Calif., Oct.), pp. 17–22.
Hintz, R.G., and Tate, D.P. 1972. Control Data STAR-100 processor design. InProc., COMPCON '72 (Sept.), pp. 1–4.
Hsu, P.Y.T. 1986. Highly concurrent scalar processing. Ph.D. thesis, Univ. of Ill., Urbana-Champaign, Ill.
Google Scholar
Hsu, P.Y.T., and Davidson, E.S. 1986. Highly concurrent scalar processing. InProc., Thirteenth Annual Internat. Symp. on Computer Architecture, pp. 386–395.
Hsu, W.-C. 1987. Register allocation and code scheduling for load/store architectures. Comp. Sci. Tech. Rept. no. 722, Univ. of Wisc., Madison.
Google Scholar
Hu, T.C. 1961. Parallel sequencing and assembly line problems.Operations Research, 9, 6: 841–848.
Google Scholar
Hwu, W.W., and Chang, P.P. 1988. Exploiting parallel microprocessor microarchitectures with a compiler code generator. In Proc.,15th Annual Internat. Symp. on Computer Architecture (Honolulu, May), pp. 45–53.
Hwu, W.W., and Patt, Y.N. 1986. HPSm, a high performance restricted data flow architecture having minimal functionality. InProc., 13th Annual Internat. Symp. on Computer Architecture (Tokyo, June), pp. 297–306.
Hwu, W.W., and Patt, Y.N. 1987. Checkpoint repair for out-of-order execution machines.IEEE Trans. Comps., C-36, 12 (Dec): 1496–1514.
Google Scholar
Hwu, W.W., Conte, T.M., and Chang, P.P. 1989. Comparing software and hardware schemes for reducing the cost of branches. InProc., 16th Annual Internat. Symp. on Computer Architecture (May), pp. 224–233.
Hwu, W.W., Mahlke, S.A., Chen, W.Y., Chang, P.P., Waiter, N.J., Bringmann, R.A., Ouellette, R.G., Hank, R.E., Kiyohara, T., Haab, G.E., Holm, J.G., and Lavery, D.M. 1993. The superblock: An effective technique for VLIW and superscalar compilation.The J. Supercomputing, 7, 1/2: 229–248.
Google Scholar
IBM. 1967.IBM J. Res. and Dev., 11, 1 (Jan.). Special issue on the System/360 Model 91.
IBM. 1976.IBM 3838 Array Processor Functional Characteristics. Pub. no. 6A24-3639-0, file no. S370-08, IBM Corp., Endicott, N.Y.
Google Scholar
IBM. 1990.IBM J. Res. and Dev., 34, 1 (Jan.). Special issue on the IBM RISC System/6000 processor.
Intel. 1989a.i860 64-Bit Microprocessor Programmer's Reference Manual. Pub. no. 240329-001, Intel Corp., Santa Clara, Calif.
Google Scholar
Intel. 1989b.80960CA User's Manual. Pub. no. 270710-001, Intel Corp., Santa Clara, Calif.
Google Scholar
Jain, S. 1991. Circular scheduling: A new technique to perform software pipelining. InProc., ACM SIGPLAN '91 Conf. on Programming Language Design and Implementation (June), pp. 219–228.
Johnson, M. 1991.Superscalar Microprocessor Design. Prentice-Hall, Englewood Cliffs, N.J.
Google Scholar
Jouppi, N.P. 1989. The nonuniform distribution of instruction-level and machine parallelism and its effect on performance.IEEE Trans. Comps., C-38, 12 (Dec): 1645–1658.
Google Scholar
Jouppi, N.P., and Wall, D. 1989. Available instruction level parallelism for superscalar and superpipelined machines. InProc., Third Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Apr.), pp. 272–282.
Kasahara, H., and Narita, S. 1984. Practical multiprocessor scheduling algorithms for efficient parallel processing.IEEE Trans. Comps., C-33, 11 (Nov.): 1023–1029.
Google Scholar
Keller, R.M. 1975. Look-ahead processors.Computing Surveys 7, 4 (Dec): 177–196.
Google Scholar
Kleir, R.L. 1974. A representation for the analysis of microprogram operation. InProc., 7th Annual Workshop on Microprogramming (Sept.), pp. 107–118.
Kleir, R.L., and Ramamoorthy, C.V. 1971. Optimization strategies for microprograms.IEEE Trans. Comps., C-20, 7 (July): 783–794.
Google Scholar
Kogge, P.M. 1973. Maximal rate pipelined solutions to recurrence programs. InProc., First Annual Symp. on Computer Architecture (Univ. of Fla., Gainesville, Dec), pp. 71–76.
Google Scholar
Kogge, P.M. 1974. Parallel solution of recurrence problems.IBM J. Res. and Dev., 18, 2 (Mar.): 138–148.
Google Scholar
Kogge, P.M. 1977a. Algorithm development for pipelined processors. InProc., 1977 Internat. Conf. on Parallel Processing (Aug.), p. 217.
Kogge, P.M. 1977b. The microprogramming of pipelined processors. InProc., 4th Annual Symp. on Computer Architecture (Mar.), pp. 63–69.
Kogge, P.M. 1981.The Architecture of Pipelined Computers. McGraw-Hill, New York.
Google Scholar
Kogge, P.M., and Stone, H.S. 1973. A parallel algorithm for the efficient solution of a general class of recurrence equations.IEEE Trans. Comps., C-22, 8 (Aug.): 786–793.
Google Scholar
Kohler, W.H. 1975. A preliminary evaluation of the critical path method for scheduling tasks on multiprocessor systems.IEEE Trans. Comps., C-24, 12 (Dec): 1235–1238.
Google Scholar
Kohn, L., and Margulis, N. 1989. Introducing the Intel i860 64-bit microprocessor.IEEE Micro, 9, 4 (Aug.): 15–30.
Google Scholar
Kunkel, S.R., and Smith, J.E. 1986. Optimal pipelining in supercomputers. InProc., 13th Annual Internat. Symp. on Computer Architecture (Tokyo, June), pp. 404–411.
Labrousse, J., and Slavenburg, G.A. 1988. CREATE-LIFE: A design system for high performance VLSI circuits. InProc., Internat. Conf. on Circuits and Devices, pp. 365–360.
Labrousse, J., and Slavenburg, G.A. 1990a. A 50 MHz microprocessor with a VLIW architecture. InProc., ISSCC '90 (San Francisco), pp. 44–45.
Labrousse, J., and Slavenburg, G.A. 1990b. CREATE-LIFE: A modular design approach for high performance ASICs. InProc., COMPCON '90 (San Francisco), pp. 427–433.
Lam, M.S.-L. 1987. A systolic array optimizing compiler. Ph.D. thesis, Carnegie Mellon Univ., Pittsburgh.
Google Scholar
Lam. M. 1988. Software pipelining: An effective scheduling technique for VLIW machines. InProc., ACM SIGPLAN '88 Conf. on Programming Language Design and Implementation (Atlanta, June), pp. 318–327.
Lam, M.S., and Wilson, R.P. 1992. Limits of control flow on parallelism. InProc., Nineteenth Internat. Symp. on Computer Architecture (Gold Coast, Australia, May), pp. 46–57.
Landskov, D., Davidson, S., Shriver, B., and Mallett, P.W. 1980. Local microcode compaction techniques.ACM Computer Surveys, 12, 3 (Sept.): 261–294.
Google Scholar
Lee, J.K.F., and Smith, A.J. 1984. Branch prediction strategies and branch target buffer design.Computer, 17, 1 (Jan.): 6–22.
Google Scholar
Lee, M., Tirumalai, P.P., and Ngai, T.-F. 1993. Software pipelining and superblock scheduling: Compilation techniques for VLIW machines. InProc., 26th Annual Hawaii Internat. Conf. on System Sciences (Hawaii, Jan.), vol. 1, pp. 202–213.
Google Scholar
Linn, J.L. 1988. Horizontal microcode compaction. InMicroprogramming and Firmware Engineering Methods (S. Habib, ed.), Van Nostrand Reinhold, New York, pp. 381–431.
Google Scholar
Lowney, P.G., Freudenberger, S.M., Karzes, T.J., Lichtenstein, W.D., Nix, R.P., O'Donnell, J.S., and Ruttenburg, J.C. 1993. The Multiflow trace scheduling compiler.The J. Supercomputing, 7, 1/2: 51–142.
Google Scholar
Mahlke, S.A., Chen, W.Y., Hwu, W.W., Rau, B.R., and Schlansker, M.S. 1992. Sentinel scheduling for VLIW and superscalar processors. InProc., Fifth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Oct.), pp. 238–247.
Mahlke, S.A., Lin, D.C., Chen, W.Y., Hank, R.E., and Bringmann, R.A. 1992. Effective compiler support for predicated execution using the hyperblock. InProc., 25th Annual Internat. Symp. on Microarchitecture (Dec), pp. 45–54.
Mallett, P.W. 1978. Methods of compacting microprograms. Ph.D. thesis, Univ. of Southwestern La., Lafayette, La.
Google Scholar
Mangione-Smith, W., Abraham, S.G., and Davidson, E.S. 1992. Register requirements of pipelined processors. InProc., Internat. Conf. on Supercomputing (Washington, D.C., July).
McFarling, S., and Hennessy, J. 1986. Reducing the cost of branches. InProc., Thirteenth Internat. Symp. on Computer Architecture (Tokyo, June), pp. 396–403.
Moon, S.-M., Ebcioglu, K. 1992. An efficient resource-constrained global scheduling technique for superscalar and VLIW processors. InProc., 25th Annual Internat. Symp. on Microarchitecture (Portland, Ore., Dec), pp. 55–71.
Nakatani, T., and Ebcioglu, K. 1990. Using a lookahead window in a compaction-based parallelizing compiler. InProc., 23rd Annual Workshop on Microprogramming and Microarchitecture (Orlando, Fla., Nov.), pp. 57–68.
Nicolau, A. 1984. Parallelism, memory anti-aliasing and correctness for trace scheduling compilers. Ph.D. thesis, Yale Univ., New Haven, Conn.
Google Scholar
Nicolau, A. 1985a. Percolation scheduling: A parallel compilation technique. Tech. Rept. TR 85-678, Dept. of Comp. Sci., Cornell, Ithaca, N.Y.
Google Scholar
Nicolau, A. 1985b. Uniform parallelism exploitation in ordinary programs. InProc., Internat. Conf. on Parallel Processing (Aug.), pp. 614–618.
Nicolau, A., and Fisher, J.A. 1981. Using an oracle to measure parallelism in single instruction stream programs. InProc., Fourteenth Annual Microprogramming Workshop (Oct.), pp. 171–182.
Nicolau, A., and Fisher, J.A. 1984. Measuring the parallelism available for very long instruction word architectures.IEEE Trans. Comps., C-33, 11 (Nov.): 968–976.
Google Scholar
Nicolau, A., and Potasman, R. 1990. Realistic scheduling: Compaction for pipelined architectures. InProc., 23rd Annual Workshop on Microprogramming and Microarchitecture (Orlando, Fla., Nov.), pp. 69–79.
Oehler, R.R., and Blasgen, M.W. 1991. IBM RISC System/6000: Architecture and performance.IEEE Micro, 11, 3 (June): 14.
Google Scholar
Papadopoulos, G.M., and Culler, D.E. 1990. Monsoon: An explicit token store architecture. InProc., Seventeenth Internat. Symp. on Computer Architecture (Seattle, May), pp. 82–91.
Park, J.C.H., and Schlansker, M.S. 1991. On predicated execution. Tech. Rept. HPL-91-58, Hewlett Packard Laboratories.
Patel, J.H. 1976. Improving the throughput of pipelines with delays and buffers. Ph.D. thesis, Univ. of Ill., Urbana-Champaign, Ill.
Google Scholar
Patel, J.H., and Davidson, E.S. 1976. Improving the throughput of a pipeline by insertion of delays. InProc., 3rd Annual Symp. on Computer Architecture (Jan.), pp. 159–164.
Patterson, D.A., and Sequin, C.H. 1981. RISC I: A reduced instruction set VLSI computer. InProc., 8th Annual Symp. on Computer Architecture (Minneapolis, May), pp. 443–450.
Peterson, C., Sutton, J., and Wiley, P., 1991. iWarp: A 100-MOPS, LIW microprocessor for multicomputers.IEEE Micro, 11, 3 (June): 26.
Google Scholar
Popescu, V., Schultz, M., Spracklen, J., Gibson, G., Lightner, B., and Isaman, D. 1991. The Metaflow architecture.IEEE Micro, 11, 3 (June): 10.
Google Scholar
Radin, G. 1982. The 801 minicomputer. InProc., Symp. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Mar.), pp. 39–47.
Ramakrishnan, S. 1992. Software pipelining in PA-RISC compilers.Hewlett-Packard J. (July): 39–45.
Ramamoorthy, C.V., and Gonzalez, M.J. 1969. A survey of techniques for recognizing parallel processable streams in computer programs. InProc., AFIPS Fall Joint Computing Conf., pp. 1–15.
Ramamoorthy, C.V., and Tsuchiya, M. 1974. A high level language for horizontal microprogramming.IEEE Trans. Comps., C-23: 791–802.
Google Scholar
Ramamoorthy, C.V, Chandy, K.M., and Gonzalez, M.J. 1972. Optimal scheduling strategies in a multiprocessor system.IEEE Trans. Comps., C-21, 2 (Feb.): 137–146.
Google Scholar
Rau, B.R. 1988. Cydra 5 Directed Dataflow architecture. InProc., COMPCON '88 (San Francisco, Mar.), pp. 106–113.
Rau, B.R. 1992. Data flow and dependence analysis for instruction level parallelism. InFourth Internat. Workshop on Languages and Compilers for Parallel Computing (U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, eds.), Springer-Verlag, pp. 236–250.
Rau, B.R., and Glaeser, CD. 1981. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. InProc., Fourteenth Annual Workshop on Microprogramming (Oct.), pp. 183–198.
Rau, B.R., Glaeser, C.D., and Greenawalt, E.M. 1982. Architectural support for the efficient generation of code for horizontal architectures. InProc., Symp. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Mar.), pp. 96–99.
Rau, B.R., Glaeser, CD., and Picard, R.L. 1982. Efficient code generation for horizontal architectures: Compiler techniques and architectural support. InProc., Ninth Annual Internat. Symp. on Computer Architecture (Apr.), pp. 131–139.
Rau, B.R., Lee, M., Tirumalai, P., and Schlansker, M.S. 1992. Register allocation for software pipelined loops. InProc., SIGPLAN '92 Conf. on Programming Language Design and Implementation (San Francisco, June 17–19), pp. 283–299.
Rau, B.R., Yen, D.W.L., Yen, W., and Towle, R.A. 1989. The Cydra 5 departmental supercomputer: Design philosophies, decisions and trade-offs.Computer, 22, 1 (Jan.): 12–34.
Google Scholar
Riseman, E.M., and Foster, C.C. 1972. The inhibition of potential parallelism by conditional jumps.IEEE Trans. Comps., C-21, 12 (Dec): 1405–1411.
Google Scholar
Ruggiero, J.F., and Coryell, D. A. 1969. An auxiliary processing system for array calculations.IBM Systems J., 8, 2: 118–135.
Google Scholar
Russell, R.M. 1978. The CRAY-1 computer system.CACM, 21: 63–72.
Google Scholar
Rymarczyk, J. 1982. Coding guidelines for pipelined processors. InProc., Symp. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Mar.), pp, 12–19.
Schmidt, U., and Caesar, K. 1991. Datawave: A single-chip multiprocessor for video applications.IEEE Micro, 11, 3 (June): 22.
Google Scholar
Schneck, P.B. 1987.Supercomputer Architecture. Kluwer Academic, Norwell, Mass.
Google Scholar
Schuette, M.A., and Shen, J.P. 1993. Instruction-level experimental evaluation of the Multiflow TRACE 14/300 VLIW computer.The J. Supercomputing, 7, 1/2: 249–271.
Google Scholar
Sethi, R. 1975. Complete register allocation problems.SIAM J. Computing, 4, 3: 226–248.
Google Scholar
Sethi, R., and Ullman, J.D. 1970. The generation of optimal code for arithmetic expressions,JACM, 17, 4 (Oct.): 715–728.
Google Scholar
Sites, R.L. 1978. Instruction ordering for the CRAY-1 computer. Tech. rept. 78-CS-023, Univ. of Calif., San Diego.
Google Scholar
Smith, J.E. 1981. A study of branch prediction strategies. InProc., Eighth Annual Internat. Symp. on Computer Architecture (May), pp. 135–148.
Smith, J.E. 1982. Decoupled access/execute architectures. InProc., Ninth Annual Internat. Symp. on Computer Architecture (Apr.), pp. 112–119.
Smith, J.E. 1989. Dynamic instruction scheduling and the Astronautics ZS-1.Computer, 22, 1 (Jan.): 21–35.
Google Scholar
Smith, J.E., and Pleszkun, A.R. 1988. Implementing precise interrupts in pipelined processors.IEEE Trans. Comps., C-37, 5 (May): 562–573.
Google Scholar
Smith, J.E., Dermer, G.E., Vanderwarn, B.D., Klinger, S.D., Roszewski, CM., Fowler, D.L., Scidmore, K.R., and Laudon, J.P. 1987. The ZS-1 central processor.In Proc., Second Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Oct.), pp. 199–204.
Smith, M.D., Horowitz, M., and Lam, M. 1992. Efficient superscalar performance through boosting. In Proc.,Fifth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Oct.), pp. 248–259.
Smith, M.D., Lam, M.S., and Horowitz, M.A. 1990. Boosting beyond static scheduling in a superscalar processor. InProc., Seventeenth Internat. Symp. on Computer Architecture (June), pp. 344–354.
Smotherman, M., Krishnamurthy, S., Aravind, P.S., and Hunnicutt, D. 1991. Efficient DAG construction and heuristic calculation for instruction scheduling. InProc., 24th Annual Internat. Workshop on Microarchitecture (Albuquerque, N.M., Nov.), pp. 93–102.
Sohi, G.S., and Vajapayem, S. 1987. Instruction issue logic for high-performance, interruptable pipelined processors. InProc., 14th Annual Symp. on Computer Architecture (Pittsburgh, June), pp. 27–36.
Su, B., and Ding, S. 1985. Some experiments in global microcode compaction. InProc., 18th Annual Workshop on Microprogramming (Asilomar, Calif., Nov.), pp. 175–180.
Su, B., and Wang, J. 1991a. GURPR^*: A new global software pipelining algorithm. InProc., 24th Annual Internat. Symp. on Microarchitecture (Albuquerque, N.M., Nov.), pp. 212–216.
Su, B., and Wang, J. 1991b. Loop-carried dependence and the general URPR software pipelining approach. InProc., 24th Annual Hawaii Internat. Conf. on System Sciences (Hawaii, Jan.).
Su, B., Ding, S., and Jin, L. 1984. An improvement of trace scheduling for global microcode compaction. InProc., 17th Annual Workshop on Microprogramming (New Orleans, Oct.), pp. 78–85.
Su, B., Ding, S., and Xia, J. 1986. URPR—An extension of URCR for software pipelining. InProc., 19th Annual Workshop on Microprogramming (New York, Oct.), pp. 104–108.
Su, B., Ding, S., Wang, J., and Xia, J. 1987. GURPR—A method for global software pipelining. InProc., 20th Annual Workshop on Microprogramming (Colorado Springs, Colo., Dec), pp. 88–96.
Thistle, M.R., and Smith, B.J. 1988. A processor architecture for Horizon. InProc., Supercomputing '88, (Orlando, Fla., Nov.), pp. 35–41.
Thomas, A.T., and Davidson, E.S. 1974. Scheduling of multiconfigurable pipelines. InProc., 12th Annual Allerton Conf. on Circuits and Systems Theory (Allerton, Ill.), pp. 658–669.
Thornton, J.E. 1964. Parallel operation in the Control Data 6600. InProc., AFIPS Fall Joint Computer Conf., pp. 33–40.
Thornton, J.E. 1970.Design of a Computer—The Control Data 6600. Scott, Foresman, Glenview, Ill.
Google Scholar
Tirumalai, P., Lee, M., and Schlansker, M.S. 1990. Parallelization of loops with exits on pipelined architectures. InProc., Supercomputing '90 (Nov.), pp. 200–212.
Tjaden, G.S., and Flynn, M.J. 1970. Detection and parallel execution of parallel instructions.IEEE Trans. Comps., C-19, 10 (Oct.): 889–895.
Google Scholar
Tjaden, G.S., and Flynn, M.J. 1973. Representation of concurrency with ordering matrices.IEEE Trans. Comps., C-22, 8 (Aug.): 752–761.
Google Scholar
Tokoro, M., Tamura, E., and Takizuka, T. 1981. Optimization of microprograms.IEEE Trans. Comps., C-30, 7 (July): 491–504.
Google Scholar
Tokoro, M., Takizuka, T., Tamura, E., and Yamaura, I. 1978. A technique of global optimization of microprograms. InProc., 11th Annual Workshop on Microprogramming (Asilotnar, Calif., Nov.), pp. 41–50.
Tokoro, M., Tamura, E., Takase, K., and Tamaru, K. 1977. An approach to microprogram optimization considering resource occupancy and instruction formats. InProc., 10th Annual Workshop on Microprogramming (Niagara Falls, N.Y., Nov.), pp. 92–108.
Tomasulo, R.M. 1967. An efficient algorithm for exploiting multiple arithmetic units.IBM J. Res. and Dev., 11, 1 (Jan.): 25–33.
Google Scholar
Touzeau, R.F. 1984. A FORTRAN compiler for the FPS-164 scientific computer. InProc., ACM S1GPLAN '84 Symp. on Compiler Construction (Montreal), pp. 48–57.
Tsuchiya, M., and Gonzalez, M.J. 1974. An approach to optimization of horizontal microprograms. InProc., Seventh Annual Workshop on Microprogramming (Palo Alto, Calif.), pp. 85–90.
Tsuchiya, M., and Gonzalez, M.J. 1976. Toward optimization of horizontal microprograms,IEEE Trans. Comps., C-25, 10 (Oct.): 992–999.
Google Scholar
Uht, A.K. 1986. An efficient hardware algorithm to extract concurrency from general-purpose code. InProc., Nineteenth Annual Hawaii Conf. on System Sciences (Jan.), pp. 41–50.
Wall, D.W. 1991. Limits of instruction-level parallelism. InProc., Fourth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Santa Clara, Calif., Apr.), pp. 176–188.
Warren, H.S. 1990. Instruction scheduling for the IBM RISC System/6000 processor.IBM J. Res. and Dev., 34, 1 (Jan.): 85–92.
Google Scholar
Waiter, N.J., Bockhaus, J.W., Haab, G.E., and Subramanian, K. 1992. Enhanced modulo scheduling for loops with conditional branches. InProc., 25th Annual Internat. Symp. on Microarchitecture (Portland, Ore., Dec), pp. 170–179.
Watson, W.J. 1972. The TI ASC—A highly modular and flexible super computer architecture. InProc., AFIPS Fall Joint Computer Conf., pp. 221–228.
Wedig, R.G. 1982. Detection of concurrency in directly executed language instruction streams. Ph.D. thesis, Stanford Univ., Stanford, Calif.
Google Scholar
Weiss, S., and Smith, J.E. 1984. Instruction issue logic for pipelined supercomputers. InProc., 11th Annual Internat. Symp. on Computer Architecture, pp. 110–118.
Weiss, S., and Smith, J.E. 1987. A study of scalar compilation techniques for pipelined supercomputers. InProc., Second Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Oct.), pp. 105–109.
Wilkes, M.V. 1951. The best way to design an automatic calculating machine. InProc., Manchester Univ. Comp. Inaugural Conf. (Manchester, England, July), pp. 16–18.
Wilkes, M.V., and Stringer, J.B. 1953. Microprogramming and the design of the control circuits in an electronic digital computer. InProc., The Cambridge Philosophical Society, Part 2 (Apr.), pp. 230–238.
Wolfe, A., and Shen, J.P. 1991. A variable instruction stream extension to the VLIW architecture. InProc., Fourth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Santa Clara, Calif., Apr.), pp. 2–14.
Wood, G. 1978. On the packing of micro-operations into micro-instruction words. InProc., 11th Annual Workshop on Microprogramming (Asilomar, Calif., Nov.), pp. 51–55.
Wood, G. 1979. Global optimization of microprograms through modular control constructs. InProc., 12th Annual Workshop on Microprogramming (Hershey, Penn.), pp. 1–6.
Yau, S.S., Schowe, A.C. and Tsuchiya, M. 1974. On storage optimization of horizontal microprograms. InProc., Seventh Annual Workshop on Microprogramming (Palo Alto, Calif.), pp. 98–106.
Yeh, T.Y., and Patt, Y.N. 1992. Alternative implementations of two-level adaptive branch prediction. InProc., Nineteenth Internat. Symp. on Comp. Architecture (Gold Coast, Australia, May), pp. 124–134.
Zima, H., and Chapman, B. 1990.Supercompilers for Parallel and Vector Computers. Addison-Wesley, Reading, Mass.
Google Scholar

Download references

Author information

Authors and Affiliations

Hewlett-Packard Laboratories, 1501 Page Mill Road, Bldg. 3U, 94304, Palo Alto, CA
B. Ramakrishna Rau & Joseph A. Fisher

Authors

B. Ramakrishna Rau
View author publications
You can also search for this author in PubMed Google Scholar
Joseph A. Fisher
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rau, B.R., Fisher, J.A. Instruction-level parallel processing: History, overview, and perspective. J Supercomput 7, 9–50 (1993). https://doi.org/10.1007/BF01205181

Download citation

Issue Date: May 1993
DOI: https://doi.org/10.1007/BF01205181

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Instruction-level parallel processing: History, overview, and perspective

Abstract

Access this article

Similar content being viewed by others

A Survey on Pipelined FFT Hardware Architectures

Can GPU performance increase faster than the code error rate?

In-memory database acceleration on FPGAs: a survey

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Instruction-level parallel processing: History, overview, and perspective

Abstract

Access this article

Similar content being viewed by others

A Survey on Pipelined FFT Hardware Architectures

Can GPU performance increase faster than the code error rate?

In-memory database acceleration on FPGAs: a survey

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation