Instruction-Level Parallel Processing: History, Overview, and Perspective

Rau, B. Ramakrishna; Fisher, Joseph A.

doi:10.1007/978-1-4615-3200-2_3

B. Ramakrishna Rau² &
Joseph A. Fisher²

Part of the book series: The Springer International Series in Engineering and Computer Science ((SECS,volume 235))

181 Accesses
13 Citations

Abstract

Instruction-level parallelism (ILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a much more significant force in computer design. Several systems were built and sold commercially, which pushed ILP far beyond where it had been before, both in terms of the amount of ILP offered and in the central role ILP played in the design of the system. By the end of the decade, advanced microprocessor design at all major CPU manufacturers had incorporated ILP, and new techniques for ILP had become a popular topic at academic conferences. This article provides an overview and historical perspective of the field of ILP and its development over the past three decades.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Acosta, R.D., Kjelstrup, J., and Torng, H.C. 1986. An instruction issuing approach to enhancing performance in multiple function unit processors.IEEE Trans. Comps.C-35, 9 (Sept.): 815–828.
Google Scholar
Adam, T.L., Chandy, K.M., and Dickson, J.R. 1974. A comparison of list schedules for parallel processing systems.CACM17, 12 (Dec.): 685–690.
Google Scholar
Advanced Micro Devices. 1989.Am29000 Users Manual.Pub. no. 10620B, Advanced Micro Devices, Sunnyvale, Calif.
Google Scholar
Agerwala, T. 1976. Microprogram optimization: A survey.IEEE Trans. Comps.C-25, 10 (Oct.): 962–973.
MathSciNet Google Scholar
Agerwala, T., and Cocke, J. 1987. High performance reduced instruction set rocessors. Tech. rept. RC12434 (#55845), IBM Thomas J. Watson Research Center, Yorktown Heights, N.Y.
Google Scholar
Aho, A.V., and Johnson, S.C. 1976. Optimal code generation for expression trees.JACM23 3 (July): 488–501. Aho, A.V., Johnson, S.C., and Ullman, J.D. 1977a. Code generation for expressions with common subexpressions.JACM24, 1 (Jan.): 146–160.
Google Scholar
Aho, A.V., Johnson, S.C., and Ullman, J.D. 1977b. Code generation for machines with multiregister operations. InProc. Fourth ACM Symp. on Principles of Programming Languagespp. 21–28.
Google Scholar
Aiken, A., and Nicolau, A. 1988a. Optimal loop parallelization. In Proc., SIGPLAN’88 Conf. on Programming Language Design and Implementation (Atlanta, June), pp. 308–317.
Google Scholar
Aiken, A., and Nicolau, A. 1988b. Perfect pipelining: A new loop parallelization technique. In Proc., 1988 European Symp. on Programming, Springer Verlag, New York, pp. 221–235.
Google Scholar
Aiken, A., and Nicolau, A. 1991. A realistic resource-constrained software pipelining algorithm. InAdvances in Languages and Compilers for Parallel Processing(A. Nicolau, D. Gelernter, T. Gross, andD.Padua, eds.), Pitman/MIT Press, London, pp. 274–290.
Google Scholar
Allen, J.R., Kennedy, K., Porterfield, C., and Warren, J. 1983. Conversion of control dependence to data dependence. InProc. Tenth Annual ACM Symp. on Principles of Programming Languages(Jan.): pp. 177–189.
Google Scholar
Anderson D.W., Sparacio, F.J., and Tomasulo, R.M. 1967. The System/360 Model 91: Machine philosophy andinstruction handling.IBM J. Res. and Dev., 11 1 (Jan.): 8–24.
Google Scholar
Apollo Computer. 1988.The Series 10000 Personal Supercomputer: Inside a New Architecture.Publication no. 002402–007 2–88, Apollo Computer, Inc., Chelmsford, Mass.
Google Scholar
Arvind and Gostelow, K. 1982. The U-interpreter.Computer15, 2 (Feb.): 12–49.
Google Scholar
Arvind and Kathail, V. 1981. A multiple processor dataflow machine that supports generalised procedures. InProc. Eighth Annual Symp. on Computer Architecture(May): pp. 291–302.
Google Scholar
Auslander, M., and Hopkins, M. 1982.An overview of the PL.8 compiler. In Proc., ACM SIGPLAN Symp. on Compiler Construction(Boston, June),pp.22–31.
Google Scholar
Bahr, R.,Ciavaglia, S., Flahive, B., Kline, M., Mageau, P., and Nickel, D. 1991. The DN10000TX: A new high-performance PRISM processor. InProc. COMPCON ‘81pp. 90–95.
Google Scholar
Baker, K.R. 1974.Introduction to Sequencing and Scheduling.John Wiley, New York.
Google Scholar
Beck,G.R.,Yen,D.W.L.,andAndersonT.L.1993.TheCydra 5 minisupercomputer:Architecture and implementation.The J. Supercomputing7, 1/2: 143–180.
Google Scholar
Bell, C.G., and Newell, A. 1971.Computer Structures: Readings and Examples. McGraw-Hill, New York. Bernstein, D., and Rodeh, M. 1991. Global instruction scheduling for superscalar machines. In Proc. SIGPLAN ‘81 Conf. on Programming Language Design and Implementation(June), pp. 241–255.
Google Scholar
Bernstein, D., Cohen, D., and Krawczyk, H. 1991.Code duplication: An assist for global instruction scheduling. In Proc., 24th Annual Internat. Symp. on Microarchitecture(Albuquerque, N.Mex.), pp. 103–113.
Google Scholar
Blanck, G., and Krueger, S. 1992. The SuperSPARC“ microprocessor. InProc. COMPCON ‘82pp. 136–141. Bloch, E. 1959. The engineering design of the STRETCH computer. InProc. Eastern Joint Computer Conf.pp. 48–59.
Google Scholar
Bruno, J.L., and Sethi, R. 1976. Code generation for a one-register machine.JACM23, 3 (July): 502–510. Buchholz, W., ed. 1962. Planning a Computer System: Project Stretch.McGraw-Hill, New York.
Google Scholar
Butler, M., Yeh, T., Patt., Y., Alsup, M., Scales, H., and Shebanow, M. 1991. Single instruction stream parallelismis greater than two. InProc. Eighteenth Annual Internat. Symp. on Computer Architecture(Toronto), pp. 276–286.
Google Scholar
Callahan, D., and Koblenz, B. 1991. Register allocation via hierarchical graph coloring. InProc. SIGPLAN ‘81Conf. on Programming Language Design and Implementation(Toronto, June), pp. 192–203.
Google Scholar
Callahan, D., Carr, S., and Kennedy, K. 1990. Improving register allocation for subscripted variables. In Proc., ACM SIGPLAN ‘80 Conf. on Programming Language Design and Implementation, (White Plains, N.Y., June), pp. 53–65.
Google Scholar
Carpenter, RE., and Doran, R.W., eds. 1986. A.M. Turing’s ACE Report of 1946 and Other Papers.MIT Press, Cambridge, Mass.
Google Scholar
Chaitin, G.J. 1982. Register allocation and spilling via graph coloring. In Proc., ACM SIGPLAN Symp. on Compiler Construction (Boston, June), pp. 98–105.
Google Scholar
Chang, P.P., and Hwu, W.W. 1988. Trace selection for compiling large C application programs to microcode. In Proc., 21st Annual Workshop on Microprogramming and Microarchitectures (San Diego, Nov.), pp. 21–29.
Google Scholar
Chang, P.P., and Hwu, W.W. 1992. Profile-guided automatic inline expansion for C programs.Software-Practice and Experience22, 5 (May): 349–376.
Google Scholar
Chang, P.P., Lavery, D.M., and Hwu, W.W. 1991. The importance of prepass code scheduling for superscalar and superpipelined processors. Tech. Rept. no. CRHC-91–18, Center for Reliable and High-Performance Computing, Univ. of Ill, Urbana-Champaign, Ill.
Google Scholar
Chang, P.P., Mahlke, S.A., Chen, W.Y., Warter, N.J., and Hwu, W.W. 1991. IMPACT: An architectural framework for multiple-instruction-issue processors. In Proc., 18th Annual Internat. Symp. on Computer Architecture (Toronto, May), pp. 266–275.
Google Scholar
Charlesworth, A.E. 1981. An approach to scientific array processing: The architectural design of the AP-120B/ FPS-164 family.Computer14, 9: 18–27.
Google Scholar
Chen, T.C. 1971. Parallelism, pipelining, and computer efficiency.Computer Design10, 1 (Jan.): 69–74
Google Scholar
Chen, T.C. 1975. Overlap and pipeline processing. InIntroduction to Computer Architecture(H.S. Stone, ed.), Science Research Associates, Chicago, pp. 375–431.
Google Scholar
Chow, E, and Hennessy, J. 1984. Register allocation by priority-based coloring. In Proc., ACM SIGPLAN Symp. on Compiler Construction (Montreal, June), pp. 222–232.
Google Scholar
Chow, F.C., and Hennessy, J.L. 1990. The priority-based coloring approach to register allocation.ACM Trans. Programming Languages and Systems12 (Oct.): 501–536.
Google Scholar
Coffman, J.R., ed. 1976. Computer and Job-Shop Scheduling Theory. John Wiley, New York.
MATH Google Scholar
Coffman, E.G., and Graham, R.L. 1972. Optimal scheduling for two processor systems.Acta Informatica1, 3: 200–213.
MathSciNet Google Scholar
Cohen, D. 1978. A methodology for programming a pipeline array processor. In Proc., 11th Annual Microprogramming Workshop (Asilomar, Calif., Nov.), pp. 82–89.
Google Scholar
Colwell, R.P., Nix, R.P., O’Donnell, J.J., Papworth, D.B., and Rodman, P.K. 1988. A VLIW architecture for a trace scheduling compiler.IEEE Trans. Comps.C-37, 8 (Aug.): 967–979.
Google Scholar
Colwell, R.P., Hall, W.E., Joshi, C.S., Papworth, D.B., Rodman, P.K., and Tornes, J.E. 1990. Architecture andimplementation of a VLIW supercomputer. InProc. Supercomputing ‘80(Nov.), pp. 910–919.
Google Scholar
Cotten, L.W. 1965. Circuit implementation of high-speed pipeline systems. InProc. AFIPS Fall Joint ComputingConf.pp. 489–504.
Google Scholar
Cotten, L.W. 1969. Maximum-rate pipeline systems. InProc. AFIPS Spring Joint Computing Conf.581–586.
Google Scholar
Danelutto, M., and Vanneschi, M. 1990. VLIW in-the-large: A model for fine grain parallelism exploitationof distributed memory multiprocessors. InProc.23rd Annual Workshop on Microprogramming and Microarchi-tecture (Nov.), pp. 7–16.
Google Scholar
Dasgupta, S., and Tartar, J. 1976. The identification of maximal parallelism in straight-line microprogramsIEEE Trans. Comps.C-25, 10 (Oct.): 986–991.
MathSciNet Google Scholar
Davidson, E.S. 1971. The design and control of pipelined function generators. InProc. 1971 Internat. IEEE Conf. an Systems Networks and Computers(Oaxtepec, Mexico, Jan.), pp. 19–21.
Google Scholar
Davidson, E.S. 1974. Scheduling for pipelined processors. InProc. 7th Hawaii Conf. on Systems Sciencespp. 58–60.
Google Scholar
Davidson, S., Landskov, D., Shriver, B.D., and Mallett, P.W. 1981. Some experiments in local microcode compaction for horizontal machines.IEEE Trans. Comps.C-30, 7: 460–477.
Google Scholar
Davidson, E.S., Shar, L.E., Thomas, A.T., and Patel, J.H. 1975. Effective control for pipelined computers. In Proc., COMPCON ‘80 (San Francisco, Feb.), pp. 181–184.
Google Scholar
Dehnert, J.C., and Towle, R.A. 1993. Compiling for the Cydra 5.TheJ. Supercomputing, 7, 1/2: 181–227.
Google Scholar
Dehnert, J.C., Hsu, P.Y.-T., and Bratt, J.P. 1989. Overlapped loop support in the Cydra 5. In Proc., Third Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Apr.), pp. 26–38.
Google Scholar
DeLano, E., Walker, W., Yetter, J., and Forsyth, M. 1992. A high speed superscalar PA-RISC processor. InProc. COMPCON ‘82(Feb.), pp. 116–121.
Google Scholar
DeWitt, D.J. 1975. A control word model for detecting conflicts between microprograms. In Proc., 8th Annual Workshop on Microprogramming (Chicago, Sept.), pp. 6–12.
Google Scholar
Diefendorff, K., and Allen, M. 1992. Organization of the Motorola 88110 superscalar RISC microprocessor.IEEE Micro12, 2 (Apr.): 40–63.
Google Scholar
Dongarra, J.J. 1986, A survey of high performance computers. In Proc., COMPCON ‘86 (Mar.), pp. 8–11. Dwyer, H., and Torng, H.C. 1992. An out-of-order superscalar processor with speculative execution and fast, precise interrupts. In Proc., 25th Annual Internat. Symp. on Microarchitecture (Portland, Ore., Dec.), pp. 272–281.
Google Scholar
Ebcioglu, K. 1988. Some design ideas for a VLIW architecture for sequential-natured software. In Parallel Processing (Proc., IFIP WG 10.3 Working Conf. on Parallel Processing, Pisa, Italy) (M. Cosnard, M.H. Barton, and M. Vanneschi, eds.), North-Holland, pp. 3–21.
Google Scholar
Ebcioglu, K., and Nakatani, T. 1989. A new compilation technique for parallelizing loops with unpredictable branches on a VLIW architecture. InLanguages and Compilers for Parallel Computing(D. Gelernter, A. Nicolau, and D. Padua, eds.), Pitman/MIT Press, London, pp. 213–229.
Google Scholar
Ebcioglu, K., and Nicolau, A. 1989. A global resource-constrained parallelization technique. In Proc., 3rd Internat. Conf. on Supercomputing (Crete, Greece, June), pp. 154–163.
Google Scholar
Eckert, J.P., Chu, J.C., Tonik, A.B., and Schmitt, W.F. 1959. Design of UNIVAC-LARC System: I. In Proc., Eastern Joint Computer Conf., pp. 59–65.
Google Scholar
Ellis, J.R. 1986.Bulldog: A Compiler for VLIW Architectures.MIT Press, Cambridge, Mass.
Google Scholar
Fawcett, B.K. 1975. Maximal clocking rates for pipelined digital systems. M.S. thesis, Univ. of Ill., Urbana-Champaign, Ill.
Google Scholar
Fernandez, E.B., and Bussel, B. 1973. Bounds on the number of processors and time for multiprocessor optimal schedule.IEEE Trans. Comps.C-22, 8 (Aug.): 745–751.
Google Scholar
Fisher, J.A. 1979. The optimization of horizontal microcode within and beyond basic blocks: An application of processor scheduling with resources, Ph.D. thesis, New York Univ., New York.
Google Scholar
Fisher, J.A. 1980. 2N-way jump microinstruction hardware and an effective instruction binding method. In Proc., 13th Annual Workshop on Microprogramming (Colorado Springs, Colo., Nov.), pp. 64–75.
Google Scholar
Fisher, J.A. 1981. Trace scheduling: A technique for global microcode compaction.IEEE Trans. Comps.C-30, 7 (July): 478–490.
Google Scholar
Fisher, J.A. 1983. Very long instruction word architectures and the ELI-512. In Proc., Tenth Annual Internat. Symp. on Computer Architecture (Stockholm, June), pp. 140–150.
Google Scholar
Fisher, J.A. 1992. Trace Scheduling-2, an extension of trace scheduling. Tech. rept., Hewlett-Packard Laboratories. Fisher, J.A., and Freudenberger, S.M. 1992. Predicting conditional jump directions from previous runs of a pro-gram. In Proc., Fifth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Oct.), pp. 85–95.
Google Scholar
Fisher, J.A., Landskov, D., and Shriver, B.D. 1981. Microcode compaction: Looking backward and looking forward. In Proc., 1981 Nat. Computer Conf.,pp. 95–102.
Google Scholar
Fisher, J.A., Ellis, J.R., Ruttenberg, J.C., and Nicolau, A. 1984. Parallel processing: A smart compiler and a dumb machine. In Proc., ACM SIGPLAN ‘84 Symp. on Compiler Construction (Montreal, June), pp. 37–47.
Google Scholar
Floating Point Systems. 1979.FPS AP-120B Processor Handbook.Floating Point Systems, Inc., Beaverton, Ore.
Google Scholar
Gonzalez, M.J. 1977. Deterministic processor scheduling.ACM Computer Surveys9, 3 (Sept.): 173–204.
Google Scholar
Goodman, J.R., and Hsu, W-C. 1988. Code scheduling and register allocation in large basic blocks. In Proc., 1988 Internat. Conf. on Supercomputing (St. Malo, France, July), pp. 442–452.
Google Scholar
Grishman, R., and Su, B. 1983. A preliminary evaluation of trace scheduling for global microcode compaction.IEEE Trans. Comps.C-32, 12 (Dec.): 1191–1194.
Google Scholar
Gross, T.R., and Hennessy, J.L. 1982. Optimizing delayed branches. InProc. 15th Annual Workshop on Microprogramming(Oct.), pp. 114–120.
Google Scholar
Gross, T., and Ward, M. 1990. The suppression of compensation code. InAdvances in Languages and Compilers for Parallel Computing(A. Nicolau, D. Gelernter, T. Gross, and D. Padua, eds.), Pitman/MIT Press, London, pp. 260–273.
Google Scholar
Gurd, J., Kirkham, C.C., and Watson, I. 1985. The Manchester prototype dataflow computer.CACM28, 1(Jan.): 34–52.
Google Scholar
Hallin, T.G., and Flynn, M.J. 1972. Pipelining of arithmetic functions.IEEE Trans. Comps.C-21, 8 (Aug.): 880–886.
Google Scholar
Hendren, L.J., Gao, G.R., Altman, E.R., and Mukerji, C. 1992. Register allocation using cyclic interval graphs: A new approach to an old problem. ACAPS Tech. Memo 33, Advanced Computer Architecture and Program Structures Group, McGill Univ., Montreal.
Google Scholar
Hennessy, J.L., and Gross, T. 1983. Post-pass code optimization of pipelined constraints. ACM Trans. Programming Languages and Systems, 5, 3 (July): 422–448.
Google Scholar
Hennessy, J., Jouppi, N., Baskett, F., Gross, T., and Gill, J. 1982. Hardware/software tradeoffs for increased performance. In Proc., Symp. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Mar.) pp. 2–11.
Google Scholar
Hennessy, J., Jouppi, N., Przybylski, S., Rowen, C., Gross, T., Baskett, E, and Gill, J. 1982. MIPS: A microproc-essor architecture. In Proc., 15th Annual Workshop on Microprogramming (Palo Alto, Calif., Oct.), pp. 17–22.
Google Scholar
Hintz, R.G., and Tate, D.P. 1972. Control Data STAR-100 processor design. InProc.COMPCON ‘72 (Sept.), pp. 1–4.
Google Scholar
Hsu, P.Y.T. 1986. Highly concurrent scalar processing. Ph.D. thesis, Univ. of Ill., Urbana-Champaign, Ill.
Google Scholar
Hsu, P.Y.T., and Davidson, E.S. 1986. Highly concurrent scalar processing. In Proc., Thirteenth Annual Internat. Symp. on Computer Architecture, pp. 386–395.
Google Scholar
Hsu, W.-C. 1987. Register allocation and code scheduling for load/store architectures. Comp. Sci. Tech. Rept. no. 722, Univ. of Wisc., Madison.
Google Scholar
Hu, T.C. 1961. Parallel sequencing and assembly line problems.Operations Research9, 6: 841–848.
MathSciNet Google Scholar
Hwu, W.W., and Chang, P.P. 1988. Exploiting parallel microprocessor microarchitectures with a compiler codegenerator. In Proc., 15th Annual Internat. Symp. on Computer Architecture (Honolulu, May), pp. 45–53.
Google Scholar
Hwu, W.W., and Patt, Y.N. 1986. HPSm, a high performance restricted data flaw architecture having minimal functionality. In Proc., 13th Annual Internat. Symp. on Computer Architecture (Tokyo, June), pp. 297–306.
Google Scholar
Hwu, W.W., and Patt, Y.N. 1987. Checkpoint repair for out-of-order execution machines.IEEE Trans. Comps.C-36, 12 (Dec.): 1496–1514.
Google Scholar
Hwu, W.W., Conte, T.M., and Chang, P.P. 1989. Comparing software and hardware schemes for reducing the cost of branches. InProc. 16th Annual Internat. Symp. on Computer Architecture(May), pp. 224–233.
Google Scholar
Hwu, W.W., Mahlke, S.A., Chen, W.Y., Chang, P.P., Warter, N.J., Bringmann, R.A., Ouellette, R.G., Hank, R.E., Kiyohara, T., Haab, G.E., Holm, J.G., and Lavery, D.M. 1993. The superblock: An effective technique for VLIW and superscalar compilation.The J. Supercomputing7, 1/2: 229–248.
Google Scholar
IBM. 1967. IBM J. Res. and Dey., 11, 1 (Jan.). Special issue on the System/360 Model 91.
Google Scholar
IBM. 1976.IBM 3838 Array Processor Functional Characteristics.Pub. no. 6A24–3639–0, file no. S370–08, IBM Corp., Endicott, N.Y
Google Scholar
IBM. 1990.IBM J. Res. and Dey.34, 1 (Jan.). Special issue on the IBM RISC System/6000 processor. Intel. 1989a.i860 64-Bit Microprocessor Programmer’s Reference Manual.Pub. no. 240329–001, Intel Corp., Santa Clara, Calif.
Google Scholar
Intel. 1989b.80960CA User’s Manual.Pub. no. 270710–001, Intel Corp., Santa Clara, Calif.
Google Scholar
Jain, S. 1991. Circular scheduling: A new technique to perform software pipelining. InProc. ACM SIGPLAN ‘81 Conf. on Programming Language Design and Implementation(June), pp. 219–228.
Google Scholar
Johnson, M. 1991.Superscalar Microprocessor Design.Prentice-Hall, Englewood Cliffs, N.J.
Google Scholar
Jouppi, N.P. 1989. The nonuniform distribution of instruction-level and machine parallelism and its effect on performanceIEEE Trans. Comps.C-38, 12 (Dec.): 1645–1658.
Google Scholar
Jouppi, N.P., and Wall, D. 1989. Available instruction level parallelism for superscalar and superpipelined machines. InProc. Third Internat. Conf. on Architectural Support for Programming Languages and Operating Systems(Boston, Apr.), pp. 272–282.
Google Scholar
Kasahara, H., and Narita, S. 1984. Practical multiprocessor scheduling algorithms for efficient parallel processing.IEEE Trans. Comps.C-33, 11 (Nov.): 1023–1029.
Google Scholar
Keller, R.M. 1975. Look-ahead processors.Computing Surveys7, 4 (Dec.): 177–196.
MATH Google Scholar
Kleir, R.L. 1974. A representation for the analysis of microprogram operation. InProc. 7th Annual Workshop on Microprogramming(Sept.), pp. 107–118.
Google Scholar
Kleir, R.L., and Ramamoorthy, C.V. 1971. Optimization strategies for microprograms.IEEE Trans. Comps.C-20, 7 (July): 783–794.
Google Scholar
Kogge, P..M. 1973.Maximal rate pipelined solutions to recurrence programs. In Proc., First Annual Symp. on Computer Architecture (Univ. of Fla., Gainesville, Dec.), pp. 71–76.
Google Scholar
Kogge, P.M. 1974. Parallel solution of recurrence problems.IBM J. Res. and Dev18, 2 (Mar.): 138–148. Kogge, P.M. 1977a. Algorithm development for pipelined processors. InProc. 1977 Internat. Conf. on Parallel Processing(Aug.), p. 217.
Google Scholar
Kogge, P.M. 1977b. The microprogramming of pipelined processors. InProc. 4th Annual Symp. on Computer Architecture(Mar.), pp. 63–69.
Google Scholar
Kogge, P.M. 1981.The Architecture of Pipelined Computers.McGraw-Hill, New York.
MATH Google Scholar
Kogge, P.M., and Stone, H.S. 1973. A parallel algorithm for the efficient solution of a general class of recurrence equations.IEEE Trans. Comps.C-22, 8 (Aug.): 786–793.
MathSciNet Google Scholar
Kohler, W.H. 1975. A preliminary evaluation of the critical path method for scheduling tasks on multiprocessor systems.IEEE Trans. Comps.C-24, 12 (Dec.): 1235–1238.
Google Scholar
Kohn, L., and Margulis, N. 1989. Introducing the Intel i860 64-bit microprocessor.IEEE Micro9, 4 (Aug.): 15–30.
Google Scholar
Kunkel, S.R., and Smith, J.E. 1986. Optimal pipelining in supercomputers. In Proc., 13th Annual Internat Symp. on Computer Architecture (Tokyo, June), pp. 404–411.
Google Scholar
Labrousse, J., and Slavenburg, G.A. 1988. CREATE-LIFE: A design system for high performance VLSI circuits. In Proc., Internat. Conf. on Circuits and Devices, pp. 365–360.
Google Scholar
Labrousse, J., and Slavenburg, G.A. 1990a. A 50 MHz microprocessor with a VLIW architecture. In Proc., ISSCC ‘80 (San Francisco), pp. 44–45.
Google Scholar
Labrousse, J., and Slavenburg, G.A. 1990b. CREATE-LIFE: A modular design approach for high performance ASICs. InProc. COMPCON ‘80(San Francisco), pp. 427–433.
Google Scholar
Lam, M.S.-L. 1987. A systolic array optimizing compiler. Ph.D. thesis, Carnegie Mellon Univ., Pittsburgh.
Google Scholar
Lam,M, 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proc., ACM SIGPLAN’88 Conf. on Programming Language Design and Implementation (Atlanta, June), pp. 318–327.
Google Scholar
Lam, M.S., and Wilson, R.P. 1992. Limits of control flow on parallelism. In Proc., Nineteenth Internat. Symp. on Computer Architecture (Gold Coast, Australia, May), pp. 46–57.
Google Scholar
Landskov, D., Davidson, S., Shriver, B., and Mallett, P.W. 1980. Local microcode compaction techniques. ACM Computer Surveys, 12, 3 (Sept.): 261–294.
Google Scholar
Lee, J.K.F., and Smith, A.J. 1984. Branch prediction strategies and branch target buffer design.Computer, 171 (Jan.): 6–22.
Google Scholar
Lee, M., Tirumalai, P.P., and Ngai, T.F. 1993. Software pipelining and superblock scheduling: Compilation techniques for VLIW machines. In Proc., 26th Annual Hawaii Internat. Conf. on System Sciences (Hawaii, Jan.), vol. 1, pp. 202–213.
Google Scholar
Linn, J.L. 1988. Horizontal microcode compaction. InMicroprogramming and Firmware Engineering Methods(S. Habib, ed.), Van Nostrand Reinhold, New York, pp. 381–431.
Google Scholar
Lowney, P.G., Freudenberger, S.M., Karzes, T.J., Lichtenstein, W.D., Nix, R.P., O’Donnell, J.S., and Ruttenburg, J.C. 1993. The Multiflow trace scheduling compiler.The J. Supercomputing7, 1/2: 51–142.
Google Scholar
Mahlke, S.A., Chen, W.Y., Hwu, W.W., Rau, B.R., and Schlansker, M.S. 1992. Sentinel scheduling for VLIW and superscalar processors. In Proc., Fifth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Oct.), pp. 238–247.
Google Scholar
Mahlke, S.A., Lin, D.C., Chen, W.Y., Hank, R.E., and Bringmann, R.A. 1992. Effective compiler support for predicated execution using the hyperblock. In Proc., 25th Annual Internat. Symp. on Microarchitecture (Dec.), pp. 45–54.
Google Scholar
Mallett, P.W. 1978. Methods of compacting microprograms. Ph.D. thesis, Univ. of Southwestern La., Lafayette, La.
Google Scholar
Mangione-Smith, W., Abraham, S.G., and Davidson, E.S. 1992. Register requirements of pipelined processors. In Proc., Internat. Conf. on Supercomputing (Washington, D.C., July).
Google Scholar
McFarling, S., and Hennessy, J. 1986. Reducing the cost of branches. In Proc., Thirteenth Internat. Symp. on Computer Architecture (Tokyo, June), pp. 396–403.
Google Scholar
Moon, S.-M., Ebcioglu, K. 1992. An efficient resource-constrained global scheduling technique for superscalar and VLIW processors. In Proc., 25th Annual Internat. Symp. on Microarchitecture (Portland, Ore., Dec.), pp. 55–71.
Google Scholar
Nakatani, T., and Ebcioglu, K. 1990. Using a lookahead window in a compaction-based parallelizing compiler. In Proc., 23rd Annual Workshop on Microprogramming and Microarchitecture (Orlando, Fla., Nov.), pp. 57–68.
Google Scholar
Nicolau, A. 1984. Parallelism, memory anti-aliasing and correctness for trace scheduling compilers. Ph.D. thesis, Yale Univ., New Haven, Conn.
Google Scholar
Nicolau, A. 1985a. Percolation scheduling: A parallel compilation technique. Tech. Rept. TR 85–678, Dept. of Comp. Sci., Cornell, Ithaca, N.Y.
Google Scholar
Nicolau, A. 1985b. Uniform parallelism exploitation in ordinary programs. In Proc., Internat. Conf. on Parallel Processing (Aug.), pp. 614–618.
Google Scholar
Nicolau, A., and Fisher, J.A. 1981. Using an oracle to measure parallelism in single instruction stream programs. In Proc., Fourteenth Annual Microprogramming Workshop (Oct.), pp. 171–182.
Google Scholar
Nicolau, A., and Fisher, J.A. 1984. Measuring the parallelism available for very long instruction word architectures.IEEE Trans. Comps.C-33, 11 (Nov.): 968–976.
Google Scholar
Nicolau, A., and Potasman, R. 1990. Realistic scheduling: Compaction for pipelined architectures. In Proc., 23rd Annual Workshop on Microprogramming and Microarchitecture (Orlando, Fla., Nov.), pp. 69–79.
Google Scholar
Oehler, R.R., and Blasgen, M.W. 1991. IBM RISC System/6000:Architecture and performance. IEEE Micro, 11, 3 (June): 14.
Google Scholar
Papadopoulos, G.M., and Culler, D.E. 1990. Monsoon: An explicit token store architecture. In Proc., Seventeenth Internat. Symp. on Computer Architecture (Seattle, May), pp. 82–91.
Google Scholar
Park, J.C.H., and Schlansker, M.S. 1991. On predicated execution. Tech. Rept. HPL-91–58, Hewlett Packard Laboratories.
Google Scholar
Patel, J.H. 1976. Improving the throughput of pipelines with delays and buffers. Ph.D. thesis, Univ. of Ill., Urbana-Champaign, Ill.
Google Scholar
Patel, J.H., and Davidson, E.S. 1976. Improving the throughput of a pipeline by insertion of delays. In Proc., 3rd Annual Symp. on Computer Architecture (Jan.), pp. 159–164.
Google Scholar
Patterson, D.A., and Sequin, C. H. 1981. RISC I: A reduced instruction set VLSI computer. In Proc., 8th Annual Symp. on Computer Architecture (Minneapolis, May), pp. 443–450.
Google Scholar
Peterson, C., Sutton, J., and Wiley, P., 1991. iWarp: A 100-MOPS, LIW microprocessor for multicomputers. IEEE Micro, 11, 3 (June): 26.
Google Scholar
Popescu, V., Schultz, M., Spracklen, J., Gibson, G., LightnerB.and Isaman, D. 1991. The Metaflow architecture. IEEE Micro, 11, 3 (June): 10.
Google Scholar
Radin, G. 1982. The 801 minicomputer. InProc. Symp. on Architectural Support for Programming Languages and Operating Systems(Palo Alto, Calif., Mar.), pp. 39–47.
Google Scholar
Ramakrishnan, S. 1992. Software pipelining in PA-RISC compilers.Hewlett-Packard J.(July): 39–45. Ramamoorthy, C.V., and Gonzalez, M.J. 1969. A survey of techniques for recognizing parallel processable streams in computer programs. InProc. AFIPS Fall Joint Computing Conf. pp. 1–15.
Google Scholar
Ramamoorthy, C.V., and Tsuchiya, M. 1974. A high level language for horizontal microprogramming.IEEE Trans. Comps.C-23: 791–802.
Google Scholar
Ramamoorthy, CV, Chandy, K.M., and Gonzalez, M.J. 1972. Optimal scheduling strategies in a multiprocessor system.IEEE Trans. Comps.C-21, 2 (Feb.): 137–146.
MathSciNet Google Scholar
Rau, B.R. 1988. Cydra 5 Directed Dataflow architecture. In Proc., COMPCON ‘88 (San Francisco, Mar.), pp. 106–113.
Google Scholar
Rau, B.R. 1992. Data flow and dependence analysis for instruction level parallelism. In Fourth Internat. Workshop on Languages and Compilers for Parallel Computing (U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, eds.), Springer-Verlag, pp. 236–250.
Google Scholar
Rau, B.R., and Glaeser, C.D. 1981. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In Proc., Fourteenth Annual Workshop on Microprogramming (Oct.), pp. 183–198.
Google Scholar
Rau, B.R., Glaeser, C.D., and Greenawalt, E.M. 1982. Architectural support for the efficient generation of code for horizontal architectures. In Proc., Symp. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Mar.), pp. 96–99.
Google Scholar
Rau, B.R., Glaeser, C.D., and Picard, R.L. 1982. Efficient code generation for horizontal architectures: Compiler techniques and architectural support. In Proc., Ninth Annual Internat. Symp. on Computer Architecture (Apr.), pp. 131–139.
Google Scholar
Rau, B.R., Lee, M., Tirumalai, P., and Schlansker, M.S. 1992. Register allocation for software pipelined loops. In Proc., SIGPLAN ‘82 Conf. on Programming Language Design and Implementation (San Francisco, June 17–19), pp. 283–299.
Google Scholar
Rau, B.R., Yen, D.W.L., Yen, W., and Towle, R.A. 1989. The Cydra 5 departmental supercomputer: Design philosophies, decisions and trade-offs.Computer22, 1 (Jan.): 12–34.
Google Scholar
Riseman, E.M., and Foster, C.C. 1972. The inhibition of potential parallelism by conditional jumps.IEEE Trans. Comps.C-21, 12 (Dec.): 1405–1411.
Google Scholar
Ruggiero, J.F., and Coryell, D.A. 1969. An auxiliary processing system for array calculations.IBM Systems J.8, 2: 118–135.
Google Scholar
Russell, R.M. 1978. The CRAY-1 computer system.CACM21: 63–72.
Google Scholar
Rymarczyk, J. 1982. Coding guidelines for pipelined processors. In Proc., Symp. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Mar.), pp, 12–19.
Google Scholar
Schmidt, U., and Caesar, K. 1991. Datawave: A single-chip multiprocessor for video applications. IEEE Micro11, 3 (June): 22.
Google Scholar
Schneck, P.B. 1987.Supercomputer Architecture.Kluwer Academic, Norwell, Mass.
Google Scholar
Schuette, M.A., and Shen, J.P. 1993. Instruction-level experimental evaluation of the Multiflow TRACE 14/300 VLIW computer. The J. Supercomputing, 7, 1/2: 249–271.
Google Scholar
Sethi, R. 1975. Complete register allocation problems.SIAM J. Computing4, 3: 226–248.
MathSciNet MATH Google Scholar
Sethi, R., and Ullman, J.D. 1970. The generation of optimal code for arithmetic expressionsJACM17, 4 (Oct.): 715–728.
MathSciNet Google Scholar
Sites, R.L. 1978. Instruction ordering for the CRAY-1 computer. Tech. rept. 78-CS-023, Univ. of Calif., San Diego.
Google Scholar
Smith, J.E. 1981. A study of branch prediction strategies. In Proc., Eighth Annual Internat. Symp. on Computer Architecture (May), pp. 135–148.
Google Scholar
Smith, J.E. 1982. Decoupled access/execute architectures. In Proc., Ninth Annual Internat. Symp. on Computer Architecture (Apr.), pp. 112–119.
Google Scholar
Smith, J.E. 1989. Dynamic instruction scheduling and the Astronautics ZS-1.Computer22, 1 (Jan.): 21–35.
MATH Google Scholar
Smith, J.E., and Pleszkun, A.R. 1988. Implementing precise interrupts in pipelined processors.IEEE Trans. Comps.C-37, 5 (May): 562–573.
Google Scholar
Smith, J.E., Dermer, G.E., Vanderwarn, B.D., Klinger, S.D., Roszewski, C.M., Fowler, D.L., Scidmore, K.R., and Laudon, J.P. 1987. The ZS-1 central processor. InProc. Second Internat. Conf. on Architectural Support for Programming Languages and Operating Systems(Palo Alto, Calif., Oct.), pp. 199–204.
Google Scholar
Smith, M.D., Horowitz, M., and Lam, M. 1992. Efficient superscalar performance through boosting. In Proc., Fifth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Oct.), pp. 248–259.
Google Scholar
Smith, M.D., Lam, M.S., and Horowitz, M.A. 1990. Boosting beyond static scheduling in a superscalar processor. In Proc., Seventeenth Internat. Symp. on Computer Architecture (June), pp. 344–354.
Google Scholar
Smotherman, M., Krishnamurthy, S., Aravind, P.S., and Hunnicutt, D. 1991. Efficient DAG construction and heuristic calculation for instruction scheduling. In Proc., 24th Annual Internat. Workshop on Microarchitecture (Albuquerque, N.M., Nov.), pp. 93–102.
Google Scholar
Sohi, G.S., and Vajapayem, S. 1987. Instruction issue logic for high-performance, interruptable pipelined proc-essors. In Proc., 14th Annual Symp. on Computer Architecture (Pittsburgh, June), pp. 27–36.
Google Scholar
Su, B., and Ding, S. 1985. Some experiments in global microcode compaction. In Proc., 18th Annual Workshopon Microprogramming (Asilomar, Calif., Nov.), pp. 175–180.
Google Scholar
Su, B., and Wang, J. 1991a. GURPR*: A new global software pipelining algorithm. In Proc., 24th Annual Internat. Symp. on Microarchitecture (Albuquerque, N.M., Nov.), pp. 212–216.
Google Scholar
Su, B., and Wang, J. 1991b. Loop-carried dependence and the general URPR software pipelining approach. InProc., 24th Annual Hawaii Internat. Conf. on System Sciences (Hawaii, Jan.).
Google Scholar
Su, B., Ding, S., and Jin, L. 1984. An improvement of trace scheduling for global microcode compaction. In Proc., 17th Annual Workshop on Microprogramming (New Orleans, Oct.), pp. 78–85.
Google Scholar
Su, B., Ding, S., and Xia, J. 1986. URPR-An extension of URCR for software pipelining. In Proc., 19th Annual Workshop on Microprogramming (New York, Oct.), pp. 104–108.
Google Scholar
Su, B., Ding, S., Wang, J., and Xia, J. 1987. GURPR-A method for global software pipelining. In Proc., 20th Annual Workshop on Microprogramming (Colorado Springs, Colo., Dec.), pp. 88–96.
Google Scholar
Thistle, M.R., and Smith, B.J. 1988. A processor architecture for Horizon. In Proc., Supercomputing ‘88, (Orlando, Fla., Nov.), pp. 35–41.
Google Scholar
Thomas, A.T., and Davidson, E.S. 1974. Scheduling of multiconfigurable pipelines. In Proc., 12th Annual Allerton Conf. on Circuits and Systems Theory (Allerton, Ill.^©), pp. 658–669.
Google Scholar
Thornton, J.E. 1964. Parallel operation in the Control Data 6600. In Proc., AFIPS Fall Joint Computer Conf., pp. 33–40.
Google Scholar
Thornton, J.E. 1970. Design of a Computer-The Control Data 6600. Scott, Foresman, Glenview, Ill. Tirumalai, P., Lee, M., and Schlansker, M.S. 1990. Parallelization of loops with exits on pipelined architectures. In Proc., Supercomputing ‘80 (Nov.), pp. 200–212.
Google Scholar
Tjaden, G.S., and Flynn, M.J. 1970. Detection and parallel execution of parallel instructions.IEEE Trans. Comps.C-19, 10 (Oct.): 889–895.
Google Scholar
Tjaden, G.S., and Flynn, M.J. 1973. Representation of concurrency with ordering matrices.IEEE Trans. Comps.C-22, 8 (Aug.): 752–761.
MathSciNet Google Scholar
Tokoro, M., Tamura, E., and Takizuka, T. 1981. Optimization of microprograms.IEEE Trans. Comps.C-30, 7 (July): 491–504.
Google Scholar
Tokos, M., Takizuka, T., Tamura, E., and Yamaura, I. 1978. A technique of global optimization of microprograms. In Proc., 11th Annual Workshop on Microprogramming (Asilomar, Calif., Nov.), pp. 41–50.
Google Scholar
Tokoro, M., Tamura, E., Takase, K., and Tamaru, K. 1977. An approach to microprogram optimization considering resource occupancy and instruction formats. In Proc., 10th Annual Workshop on Microprogramming (Niagara Falls, N.Y., Nov.), pp. 92–108.
Google Scholar
Tomasulo, R.M. 1967. An efficient algorithm for exploiting multiple arithmetic units.IBM J. Res. and Dev., 11, I (Jan.): 25–33.
Google Scholar
Touzeau, R.F. 1984. A FORTRAN compiler for the FPS-164 scientific computer. In Proc., ACM SIGPLAN ‘84 Symp. on Compiler Construction (Montreal), pp. 48–57.
Google Scholar
Tsuchiya, M., and Gonzalez, M.J. 1974. An approach to optimization of horizontal microprograms. In Proc., Seventh Annual Workshop on Microprogramming (Palo Alto, Calif.), pp. 85–90.
Google Scholar
Tsuchiya, M., and Gonzalez, M.J. 1976. Toward optimization of horizontal microprogramsIEEE Trans. Comps.C-25, 10 (Oct.): 992–999.
Google Scholar
Uht, A.K. 1986. An efficient hardware algorithm to extract concurrency from general-purpose code. In Proc., Nineteenth Annual Hawaii Conf. on System Sciences (Jan.), pp. 41–50.
Google Scholar
Wall, D.W. 1991. Limits of instruction-level parallelism. In Proc., Fourth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Santa Clara, Calif., Apr.), pp. 176–188.
Google Scholar
Warren, H.S. 1990. Instruction scheduling for the IBM RISC System/6000 processor. IBM J. Res. and Dev, 34, 1 (Jan.): 85–92.
Google Scholar
Warter, N.J., Bockhaus, J.W., Haab, G.E., and Subramanian, K. 1992. Enhanced modulo scheduling for loops with conditional branches. In Proc., 25th Annual Internat. Symp. on Microarchitecture (Portland, Ore., Dec.), pp. 170–179.
Google Scholar
Watson, W.J. 1972. The TI ASC-A highly modular and flexible super computer architecture. InProc. AFIPS Fall Joint Computer Conf.pp. 221–228.
Google Scholar
Wedig, R.G. 1982. Detection of concurrency in directly executed language instruction streams. Ph.D. thesis, Stanford Univ., Stanford, Calif.
Google Scholar
Weiss, S., and Smith, J.E. 1984. Instruction issue logic for pipelined supercomputers. In Proc., 11th Annual Internat. Symp. on Computer Architecture, pp. 110–118.
Google Scholar
Weiss, S., and Smith, J.E. 1987. A study of scalar compilation techniques for pipelined supercomputers. In Proc., Second Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (PaloAlto, Calif., Oct.), pp. 105–109.
Google Scholar
Wilkes, M.V. 1951. The best way to design an automatic calculating machine. In Proc., Manchester Univ. Comp. Inaugural Conf. (Manchester, England, July), pp. 16–18.
Google Scholar
Wilkes, M.V., and Stringer, J.B. 1953. Microprogramming and the design of the control circuits in an electronic digital computer. InProc., The Cambridge Philosophical Society, Part 2(Apr.), pp. 230–238.
MathSciNet Google Scholar
Wolfe, A., and Shen, J.P. 1991. A variable instruction stream extension to the VLIW architecture. In Proc., Fourth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Santa Clara, Calif., Apr.), pp. 2–14.
Google Scholar
Wood, G. 1978. On the packing of micro-operations into micro-instruction words. In Proc., 11th Annual Workshop on Microprogramming (Asilomar, Calif., Nov.), pp. 51–55.
Google Scholar
Wood, G. 1979. Global optimization of microprograms through modular control constructs. In Proc., 12th Annual Workshop on Microprogramming (Hershey, Penn.), pp. 1–6.
Google Scholar
Yau, S.S., Schowe, A.C. and Tsuchiya, M. 1974. On storage optimization of horizontal microprograms. In Proc., Seventh Annual Workshop on Microprogramming (Palo Alto, Calif.), pp. 98–106.
Google Scholar
Yeh, T.Y., and Patt, Y.N. 1992. Alternative implementations of two-level adaptive branch prediction. In Proc., Nineteenth Internat. Symp. on Comp. Architecture (Gold Coast, Australia, May), pp. 124–134.
Google Scholar
Zima, H., and Chapman, B. 1990.Supercompilers for Parallel and Vector Computers.Addison-Wesley, Reading, Mass.
Google Scholar

Download references

Author information

Authors and Affiliations

Hewlett-Packard Laboratories, 1501 Page Mill Road, Bldg. 3U, Palo Alto, 94304, CA, USA
B. Ramakrishna Rau & Joseph A. Fisher

Authors

B. Ramakrishna Rau
View author publications
You can also search for this author in PubMed Google Scholar
Joseph A. Fisher
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Hewlett-Packard Laboratories, UK
B. R. Rau & J. A. Fisher &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Rau, B.R., Fisher, J.A. (1993). Instruction-Level Parallel Processing: History, Overview, and Perspective. In: Rau, B.R., Fisher, J.A. (eds) Instruction-Level Parallelism. The Springer International Series in Engineering and Computer Science, vol 235. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-3200-2_3

Download citation

DOI: https://doi.org/10.1007/978-1-4615-3200-2_3
Published: 13 May 2011
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-6404-7
Online ISBN: 978-1-4615-3200-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics