The MUL instruction and its operational equation shown earlier in this chapter is rewritten below:
According to the equation above, the contents of the source registers, RS1 and RS2, are multiplied by a 32-bit fixed-point multiplier, and the 64-bit result is divided into two parts. The most significant 32-bit of the result is returned to RD2, and the least significant 32-bit result to RD1.
The iterative method suggests an algorithm that uses only SLI and ADD instructions. This algorithm does not require any additional hardware for the CPU; however, it takes many clock cycles to complete. The algorithm for a four-bit multiplication is shown in Figs.
6.212 and
6.213. The same method can be extended to 32 bits.
Fig. 6.212 An iterative fixed-point multiplication algorithm
Fig. 6.213 An iterative fixed-point multiplication algorithm
Assume that the quantities, r = {r3 r2 r1 r0} and c = {c3 c2 c1 c0}, represent a four-bit multiplier and a four-bit multiplicand, respectively. Also assume that the term, pp, corresponds to a partial product sum that adds a newly generated partial product to the old partial product in each iteration. The variable, i, represents the iteration index bounded by 0 and (n-1) where n signifies the number multiplier and multiplicand bits.
The first iteration (i = 0) generates the partial product pp0, which is equal to pp0 = {c3 c2 c1 c0} if r0 = 1; otherwise, pp0 becomes equal to {0 0 0 0} as shown in Fig. 6.212.
The second iteration is composed of three steps. The first step evaluates pp1 much like pp0. If r1 = 1, then pp1 = {c3 c2 c1 c0} else pp1 = {0 0 0 0}. In the second step, pp1 is shifted one bit to the left before adding it to pp0. This step produces pp1 = {c3 c2 c1 c0 0} if r1 = 1; otherwise, pp1 = {0 0 0 0 0}. The third step adds pp1 to the old partial product, pp0, and forms the partial product sum, pp = {s4 s3 s2 s1 s0}, as mentioned earlier.
The third and fourth iterations are also composed of three steps as shown in Fig. 6.213. The difference between them is that pp2 and pp3 are shifted to the left by two bits and three bits, respectively. This way, they will be in the correct bit position before they are added to the old partial product.
Finally, generating, left shifting and adding steps of partial products result in a compact flow chart shown in Fig.
6.214. Note that each r-term in this figure is indexed by the variable i. Therefore, r = {r0 r1 r2 r3} is identical to {r[0] r[1] r[2] r[3]}. Similarly, each pp-term uses indexed representations. Therefore, pp = {pp0, pp1, pp2, pp3} is identical to {pp[0], pp[1], pp[2], pp[3]}.
Fig. 6.214 The flow chart as a result of the iterative fixed-point multiplication algorithm
Open image in new window
Draw the detailed ALU and the CPU schematic that executes these two instructions. Label all interconnections, bus widths and control signals.
Note: The reader should also attempt to implement the hardware that executes the integer multiply (MUL) instruction and superimpose it on top of the data-path that executes ADD and MADD instructions.
Open image in new window
The incremental area is calculated by the flow chart given below.
Open image in new window
- (a)
Assuming Reg[R0] = 0, write a program using the instruction set given in Chapter 6. Make comments next to each instruction in the program.
- (b)
Form an instruction chart for this program, executing in a five-stage CPU, and show all the data dependencies that require forwarding loops. Stall the pipeline using the NOP instruction if necessary. Consider the branch or jump delay penalty to be 1 cycle.
A is located at the data cache address 100. X needs to be stored at the address 200. All instructions take one cycle except multiply, which takes three cycles. The RF contains only R0 and R1. Reg[R0] = 0.
Make sure to have only 16-bit values in source registers, RS1 and RS2, in order to avoid the overflow condition in the destination register, RD, when the MUL instruction is used.
- (a)
Write an assembly code to compute and store the value of X. Make sure to write comments next to each instruction to keep track of the register values.
- (b)
Rewrite the assembly code with an instruction chart. Indicate all stalls caused by NOP instructions and forwarding loops on this chart.
The replacement policy on a cache miss is as follows:
- (i)
An entire block of data is transferred between the CPU and the cache
- (ii)
The block of with the fewest amount of references is replaced
- (iii)
The least significant block is replaced if all the memory references are the same in a set
The CPU transactions and the contents of the main memory before these transactions are shown below:
Open image in new window
- (a)
Draw the block diagram of the cache and tag memories. Show the field format of the CPU address in terms of tag, index and block offset.
- (b)
Show the cache and tag memory contents after the eighth, tenth and twelfth transactions by individually drawing the cache and tag contents. Update the main memory contents if there is any change.
-
6. A 32-bit, five-stage RISC CPU organized in Little Endian format executes the flow chart below. The CPU contains an integer RF with 32 registers where Reg[R0] = 0. The integer values, such as SUM = 0, are stored at the data memory address 100, i = 1 is stored at 101, and the compare value of 100 for i is stored at 102. The final SUM value needs to be stored in the data memory address of 200.
Open image in new window
- (a)
Write an assembly program using the following instruction set. Accompany each instruction in the program with register data and comments.
Open image in new window
A is located at the memory address 100.
B is located at the memory address 101.
Y needs to be stored at the memory address 102.
Reg[R0] = 0.
- (a)
Write a program to compute Y.
- (b)
This program executes in a six-stage CPU. Two clock cycles are required to access data memory for a LOAD operation. Rewrite the program to accommodate this requirement. Show all forwarding loops and include all the necessary NOPs in the instruction chart.
- (c)
Indicate the minimum number of clock cycles to execute the program in part (b).
Open image in new window
The instruction set and the bit-field format for each instruction are shown below.
Open image in new window
The CPU maintains the following rules:
- (i)
Every instruction is executed in a different number of clock cycles
- (ii)
No NOP instruction is allowed
- (iii)
LOAD does not have an ALU cycle but requires two data memory cycles
- (iv)
INVERT does not have a data memory cycle but requires one ALU cycle
- (v)
MUL does not have a data memory cycle but requires three ALU cycles
- (vi)
ADD does not have a data memory cycle but requires two ALU cycles
- (vii)
STORE does not have an ALU cycle but requires one data memory cycle
Construct the instruction chart to execute the flow chart above. Show all the necessary forwarding loops and possible data hazards. Show the cases in which there may be structural hazards and indicate how to prevent them.
-
9. The following instruction set needs to be executed in a 32-bit RISC CPU organized in Little Endian format. The CPU has three pipeline stages where the ALU and write-back stages are combined. The CPU is capable of executing the integer (ADDI, SLI and SRI) and floating-point (ADDF and MULF) instructions. The CPU stores the fixed and floating-point numbers in two separate register files, each containing 32 registers.
In the instruction set below, RS and RD are defined as the source and destination addresses for the integer registers, and FS1, FS2 and FD are the source and destination addresses for the floating-point registers, respectively.
Open image in new window
Show a detailed data-path of this CPU, indicating all internal bus widths and port names. Include only the necessary functional units.
Projects - 1.
Implement a 32-bit four-stage RISC CPU that executes only ADD instruction using Verilog. On a timing diagram, trace through the data and control signals at the output ports of the instruction memory, RF, ALU and write-back stages.
- 2.
Implement ADD, SUB, AND, NAND, OR, NOR, XOR, XNOR, SL and SR instructions in a 32-bit four-stage RISC CPU, and perform complete verification using Verilog.
- 3.
Implement a 32-bit five-stage RISC CPU that executes LOAD, STORE, MOVE and MOVEI instructions using Verilog. Trace through the data and control signals at the output ports of the instruction memory, RF, ALU, data memory and write-back stages in a timing diagram.
- 4.
Implement a 32-bit four-stage RISC CPU that executes only the BRA instruction using Verilog. Trace through the data and control signals at the output ports of the instruction memory and RF stages on a timing diagram.
- 5.
Implement and verify the 32-bit floating-point adder using Verilog. Verify the validity of data at the outputs of every major stage using timing diagrams and perform functional verification for the entire adder.
- 6.
Implement and verify the 32-bit floating-point multiplier using Verilog. Verify the validity of data at the outputs of every major stage using timing diagrams, and perform functional verification for the entire multiplier. Use behavioral Verilog to mimic the exponent adder and the integer multiplier.