winky 0 Light Poster

I am currently studying the impact of microarchitectural techniques. I have been looking at code and how to stall it correctly, as well as how to make it more efficient. I have been doing this through several different methods and then measuring the cycles per iteration.

I was wondering if you could look below at what I did and then let me know if I am stalling correctly and reordering correctly. It would be awesome if you guys could give me any suggestions or any feedback :).

Here is the code I am working with, as well as the latencies beyond a single cycle (note that it is beyond a single cycle so an instruction that has +N actually has N+1 cycles). Also note that the branch is always taken and the branch delayed slot is one cycle.

Sorry for the indenting.... wanted to seperate everything, but it somehow got messed up in word :(.

Latencies beyond single cycle:
Memory LD                   +3
Memory SD                   +1
Integer ADD, SUB          +0
Branches                       +1
ADDD                            +2
MULTD                           +4
DIVD                             +10

Loop:	LD		F2, 0(Rx)
I0:	MULTD	                F2, F0, F2
I1:	DIVD		F8, F2, F0
I2:	LD		F4, 0(Ry)
I3:	ADDD		F4, F0, F4
I4:	ADDD		F10, F8, F2
I5:	SD		F4, 0(Ry)
I6:	ADDI		Rx, Rx, #8
I7:	ADDI		Ry, Ry, #8
I8:	SUB		R20, R4, Rx
I9:	BNZ		R20, Loop
Branch Delayed Slot

The first method I used was stalling when there were only true data depenencies, instead of stalling on every single instruction.

Loop:	LD		F2, 0(Rx)
<stall> x 3
I0:	MULTD	                F2, F0, F2
<stall> x 4
I1:	DIVD		F8, F2, F0
I2:	LD		F4, 0(Ry)
<stall> x 3
I3:	ADDD		F4, F0, F4
I4:	ADDD		F10, F8, F2
I5:	SD		F4, 0(Ry)
I6:	ADDI		Rx, Rx, #8
I7:	ADDI		Ry, Ry, #8
I8:	SUB		R20, R4, Rx
I9:	BNZ		R20, Loop
Branch Delay Slot

This left me with 48 cycles per iteration.

Next, I used a multiple-issue design where results can be immediately forwarded from one unit to another or itself. It should only stall to observe a true data dependence.

1st Pipeline			     2nd Pipeline
Loop:	LD		F2, 0(Rx)        I0:	MULTD	F2, F0, F2 
I1:	DIVD		F8, F2, F0       I2:	LD	F4, 0(Ry)
I3:	ADDD		F4, F0, F4       <stall> x 6 (waiting for F8)   
                                                 I4:      ADDD       F10, F8, F2
I5:	SD		F4, 0(Ry)         I6:	ADDI	Rx, Rx, #8
I7:	ADDI		Ry, Ry, #8      I8:	SUB	R20, R4, Rx
I9:	BNZ		R20, Loop            Branch Delay Slot

This gave me 23 cycles per loop iteration.


The final thing I did was use the multiple-issue design and reorder the code to improve the performance.

1st Pipeline					2nd Pipeline
Loop:	LD		F2, 0(Rx)         I0:	MULTD	F2, F0, F2 
I2:	LD		F4, 0(Ry)	      I1:	DIVD        F8, F2, F0
I3:	ADDD		F4, F0, F4        I8:	SUB	R20, R4, Rx
I5:	SD		F4, 0(Ry)         I6:	ADDI        Rx, Rx, #8
I4:	ADDD		F10, F8, F2       I7:	ADDI        Ry, Ry, #8
I9:	BNZ		R20, Loop        Branch Delay Slot

This gave me 18 cycles per loop iteration.

Thanks again in advance for any responses :)