I am currently studying the impact of microarchitectural techniques. I have been looking at code and how to stall it correctly, as well as how to make it more efficient. I have been doing this through several different methods and then measuring the cycles per iteration.
I was wondering if you could look below at what I did and then let me know if I am stalling correctly and reordering correctly. It would be awesome if you guys could give me any suggestions or any feedback :).
Here is the code I am working with, as well as the latencies beyond a single cycle (note that it is beyond a single cycle so an instruction that has +N actually has N+1 cycles). Also note that the branch is always taken and the branch delayed slot is one cycle.
Sorry for the indenting.... wanted to seperate everything, but it somehow got messed up in word :(.
Latencies beyond single cycle:
Memory LD +3
Memory SD +1
Integer ADD, SUB +0
Branches +1
ADDD +2
MULTD +4
DIVD +10
Loop: LD F2, 0(Rx)
I0: MULTD F2, F0, F2
I1: DIVD F8, F2, F0
I2: LD F4, 0(Ry)
I3: ADDD F4, F0, F4
I4: ADDD F10, F8, F2
I5: SD F4, 0(Ry)
I6: ADDI Rx, Rx, #8
I7: ADDI Ry, Ry, #8
I8: SUB R20, R4, Rx
I9: BNZ R20, Loop
Branch Delayed Slot
The first method I used was stalling when there were only true data depenencies, instead of stalling on every single instruction.
Loop: LD F2, 0(Rx)
<stall> x 3
I0: MULTD F2, F0, F2
<stall> x 4
I1: DIVD F8, F2, F0
I2: LD F4, 0(Ry)
<stall> x 3
I3: ADDD F4, F0, F4
I4: ADDD F10, F8, F2
I5: SD F4, 0(Ry)
I6: ADDI Rx, Rx, #8
I7: ADDI Ry, Ry, #8
I8: SUB R20, R4, Rx
I9: BNZ R20, Loop
Branch Delay Slot
This left me with 48 cycles per iteration.
Next, I used a multiple-issue design where results can be immediately forwarded from one unit to another or itself. It should only stall to observe a true data dependence.
1st Pipeline 2nd Pipeline
Loop: LD F2, 0(Rx) I0: MULTD F2, F0, F2
I1: DIVD F8, F2, F0 I2: LD F4, 0(Ry)
I3: ADDD F4, F0, F4 <stall> x 6 (waiting for F8)
I4: ADDD F10, F8, F2
I5: SD F4, 0(Ry) I6: ADDI Rx, Rx, #8
I7: ADDI Ry, Ry, #8 I8: SUB R20, R4, Rx
I9: BNZ R20, Loop Branch Delay Slot
This gave me 23 cycles per loop iteration.
The final thing I did was use the multiple-issue design and reorder the code to improve the performance.
1st Pipeline 2nd Pipeline
Loop: LD F2, 0(Rx) I0: MULTD F2, F0, F2
I2: LD F4, 0(Ry) I1: DIVD F8, F2, F0
I3: ADDD F4, F0, F4 I8: SUB R20, R4, Rx
I5: SD F4, 0(Ry) I6: ADDI Rx, Rx, #8
I4: ADDD F10, F8, F2 I7: ADDI Ry, Ry, #8
I9: BNZ R20, Loop Branch Delay Slot
This gave me 18 cycles per loop iteration.
Thanks again in advance for any responses :)