AVX Scheduling and Parallelism

Hello,

After generating AVX code for large no of iterations i came to realize that it still uses only 2 registers zmm0 and zmm1 when the loop urnroll factor=1024,

i wonder if this register allocation allows operations in parallel?

Also i know all the elements within a single vector instruction are computed in parallel but does the elements of multiple instructions computed in parallel? like are 2 vmov with different registers executed in parallel? it can be because each core has an AVX unit. does compiler exploit it?

secondly i am generating assembly for intel and there are some offset like rip register or some constant addition in memory index. why is that so?
eg.1

vmovdqu32 zmm0, zmmword ptr [rip + c]
vpaddd zmm0, zmm0, zmmword ptr [rip + b]
vmovdqu32 zmmword ptr [rip + a], zmm0
vmovdqu32 zmm0, zmmword ptr [rip + c+64]
vpaddd zmm0, zmm0, zmmword ptr [rip + b+64]

and

eg. 2

mov rax, -393216
.p2align 4, 0x90
.LBB0_1: # %vector.body

=>This Inner Loop Header: Depth=1

vmovdqu32 zmm1, zmmword ptr [rax + c+401344] ; load c[401344] in zmm1
vmovdqu32 zmm0, zmmword ptr [rax + c+401280] ;load b[401280] in zmm0
vpaddd zmm1, zmm1, zmmword ptr [rax + b+401344] ; zmm1<-zmm1+b[401344]
vmovdqu32 zmmword ptr [rax + a+401344], zmm1 ; store zmm1 in c[401344]

vmovdqu32 zmm1, zmmword ptr [rax + c+401216]
vpaddd zmm0, zmm0, zmmword ptr [rax + b+401280] ; zmm0<-zmm0+b[401280]
vmovdqu32 zmmword ptr [rax + a+401280], zmm0 ; store zmm0 in c[401280]
vmovdqu32 zmm0, zmmword ptr [rax + c+401152]

… in the remaining instructions also there is only zmm0 and zmm1 used?

As you can see in the above examples there could be multiple registers use. also i doubt if the above set of repeating instructions in eg. 2 are executed in parallel? and why repeat zmm0 and zmm1 cant it be more zmms and all in parallel, mean the one w/o dependency. for eg in above example blue has dependency in between and red has dependency among each other they cant be executed in parallel but blue and red can be executed in parallel?

Please correct me if I am wrong.

It is possible that the issue with scheduling is constrained due to pointer-aliasing assumptions. Could you share the source for the loop in question?

RIP-relative indexing, as I recall, is a feature of position-independent code. Based on what’s below, it might cause problems by making the instruction encodings large. cc’ing some Intel folks for further comments.

 -Hal

int a[100351], b[100351], c[100351];
foo () {
int i;
for (i=0; i<100351; i++) {
a[i] = b[i] + c[i];
}
}

I’ll attempt to answer some of the questions.

Using a few registers by itself isn’t necessarily going to prevent performance. Modern processors implement what’s known as register renaming internally. Each time a register is written to the processor allocates one of nearly 200 internal registers to hold the result of the computation. Each time a register like xmm0 is read by an instruction the processor looks up in a table to see which internal register was last assigned to xmm0. This tells the instruction to read that internal register to get the value of xmm0. So if there are a lot of independent calculations its perfectly fine to use only a few registers because the processor will remap them to a much larger set internally. Out of the 200 internal registers only a few of them have a name(xmm0, xmm1, etc) accessible to software at any point in the code. Having more named registers allows you to have a lot of values “live” at a particular point in a program.

Although each core has an AVX unit, a single “thread” of a program is only able to use one core. Using the other cores requires the developer to write special code to create additional threads and give them their own work to do. The cost of moving data between cores is quite high so these multithread tasks need to be pretty indepenent.

Within a thread a processor can theoretically execute up to about 6 or so instructions simultaneously. But those 6 instructions are handled by units of the cpu designed for different tasks. For example a couple of those units only handle loads and stores. While others only handle arithmetic instructions. So to execute the maximum number of instructions in parallel you need the right mix of operations. And even the units that handle arithmetic instructions aren’t created equal. For example, there is only one unit that can handle a divide operation and that keeps the divider busy for 10s of cycles.

rip-relative addressing isn’t specific to position independent code. In fact I believe x86-64 is always position independent because it always uses rip-relative addressing. There only a few instructions that can take a full 64-bit constant address. So rip relative addressing is used for things like global variables and constant tables. The assumption is that the code and data are with 2gb of each other in the address space and always in the same layout so you can use a fixed offset from the current instruction pointer(the ip in rip stands for instruction pointer) to access the data. This results in a smaller encoding than if you needed a full 64-bit offset.

Hi Ahmed,

From what can be seen in the code snippet you provided, the reuse of XMM0 and XMM1 across loop-unroll instances does not inhibit instruction-level parallelism.

Modern X86 processors use register renaming that can eliminate the dependencies in the instruction stream. In the example you provided, the processor should be able to identify the 2-vloads + vadd + vstore sequences as independent and pipeline their execution.

Thanks, Zvi

Hi, Zvi,

I agree. In the context of targeting the KNL, however, I’m a bit concerned about the addressing, and specifically, the size of the resulting encoding:

           vmovdqu32    zmm0, zmmword ptr [rax + c+401280]        ;load b[401280] in zmm0

           vpaddd           zmm1, zmm1, zmmword ptr [rax + b+401344]      ; zmm1<-zmm1+b[401344]

The KNL can only deliver 16 bytes per cycle from the icache to the decoder. Essentially all of the instructions in the loop, as we seem to generate it, have 10-byte encodings:

 10:   62 f1 7e 48 6f 80 00    vmovdqu32 0x0(%rax),%zmm0
 17:   00 00 00
         16: R_X86_64_32S   c+0x61f00

 38:   62 f1 7d 48 fe 80 00    vpaddd 0x0(%rax),%zmm0,%zmm0
 3f:   00 00 00
         3e: R_X86_64_32S   b+0x61f00

and since this seems like a generic feature of how we generate code, it seems like we can end up decoder limited (it might even be decoder limited for this loop). We might want to less aggressive in generating complex addressing modes for the KNL. It seems like it would be better to materialize the base array addresses into a register to make the encodings shorter.

 -Hal

Hello,

I have defined 8 registers in registerinfo.td file in the following order:
R_0, R_1, R_2, R_3, R_4, R_5, R_6, R_7

But the generated assembly code only uses 2 registers. How to enable it to use all 8? Also can i control the ordering like after R_0 can i use R_5 without changes in registerinfo.td?

What changes are required here? either in scheduling or register allocation phases?

P_2048B_LOAD_DWORD R_0, Pword ptr [rip + b]
P_2048B_LOAD_DWORD R_1, Pword ptr [rip + c]
P_2048B_VADD R_0, R_1, R_0
P_2048B_STORE_DWORD Pword ptr [rip + a], R_0
P_2048B_LOAD_DWORD R_0, Pword ptr [rip + b+2048]
P_2048B_LOAD_DWORD R_1, Pword ptr [rip + c+2048]
P_2048B_VADD R_0, R_1, R_0
P_2048B_STORE_DWORD Pword ptr [rip + a+2048], R_0
P_2048B_LOAD_DWORD R_0, Pword ptr [rip + b+4096]
P_2048B_LOAD_DWORD R_1, Pword ptr [rip + c+4096]
P_2048B_VADD R_0, R_1, R_0
P_2048B_STORE_DWORD Pword ptr [rip + a+4096], R_0
P_2048B_LOAD_DWORD R_0, Pword ptr [rip + b+6144]
P_2048B_LOAD_DWORD R_1, Pword ptr [rip + c+6144]
P_2048B_VADD R_0, R_1, R_0
P_2048B_STORE_DWORD Pword ptr [rip + a+6144], R_0

Please help. I am stuck here.

Thank You