Actually my hardware is designed such that there are 32 lanes. each has 8 registers. the assembly code should be emitted keeping this fact.
I defined the registers as follows in .td in the following order;
L_0_R_0,
L_0_R_1,
L_0_R_2,
L_0_R_3,
L_0_R_4,
L_0_R_5,
L_0_R_6,
L_0_R_7,
L_1_R_0,
L_1_R_1,
L_1_R_2,
L_1_R_3,
L_1_R_4,
L_1_R_5,
L_1_R_6,
L_1_R_7,
…
L_31_R_0,
L_31_R_1,
L_31_R_2,
L_31_R_3,
L_31_R_4,
L_31_R_5,
L_31_R_6,
L_31_R_7,
Now when i assemble the vec sum code by my implemented instructions and default x86 scheduling & register allocation. it is only using L_0. But it should use all the lanes? how to achieve this.
Something as follows:
currently it is emitting as follows:
P_2048B_LOAD_DWORD L_0_R_0, Pword ptr [rip + b]
P_2048B_LOAD_DWORD L_0_R_1, Pword ptr [rip + c]
P_2048B_VADD L_0_R_0, L_0_R_1, L_0_R_0
P_2048B_STORE_DWORD Pword ptr [rip + a], L_0_R_0
P_2048B_LOAD_DWORD L_0_R_0, Pword ptr [rip + b+2048]
P_2048B_LOAD_DWORD L_0_R_1, Pword ptr [rip + c+2048]
P_2048B_VADD L_0_R_0, L_0_R_1, L_0_R_0
P_2048B_STORE_DWORD Pword ptr [rip + a+2048], L_0_R_0
P_2048B_LOAD_DWORD L_0_R_0, Pword ptr [rip + b+4096]
P_2048B_LOAD_DWORD L_0_R_1, Pword ptr [rip + c+4096]
P_2048B_VADD L_0_R_0, L_0_R_1, L_0_R_0
P_2048B_STORE_DWORD Pword ptr [rip + a+4096], L_0_R_0
P_2048B_LOAD_DWORD L_0_R_0, Pword ptr [rip + b+6144]
P_2048B_LOAD_DWORD L_0_R_1, Pword ptr [rip + c+6144]
P_2048B_VADD L_0_R_0, L_0_R_1, L_0_R_0
P_2048B_STORE_DWORD Pword ptr [rip + a+6144], L_0_R_0
It should emit as follows:
P_2048B_LOAD_DWORD L_0_R_0, Pword ptr [rip + b]
P_2048B_LOAD_DWORD L_0_R_1, Pword ptr [rip + c]
P_2048B_VADD L_0_R_0, L_0_R_1, L_0_R_0
P_2048B_STORE_DWORD Pword ptr [rip + a], L_0_R_0
P_2048B_LOAD_DWORD L_1_R_0, Pword ptr [rip + b+2048]
P_2048B_LOAD_DWORD L_1_R_1, Pword ptr [rip + c+2048]
P_2048B_VADD L_1_R_0, L_1_R_1, L_1_R_0
P_2048B_STORE_DWORD Pword ptr [rip + a+2048], L_1_R_0
P_2048B_LOAD_DWORD L_2_R_0, Pword ptr [rip + b+4096]
P_2048B_LOAD_DWORD L_2_R_1, Pword ptr [rip + c+4096]
P_2048B_VADD L_2_R_0, L_2_R_1, L_2_R_0
P_2048B_STORE_DWORD Pword ptr [rip + a+4096], L_2_R_0
P_2048B_LOAD_DWORD L_3_R_0, Pword ptr [rip + b+6144]
P_2048B_LOAD_DWORD L_3_R_1, Pword ptr [rip + c+6144]
P_2048B_VADD L_3_R_0, L_3_R_1, L_3_R_0
P_2048B_STORE_DWORD Pword ptr [rip + a+6144], L_3_R_0
does it involve changing the register live intervals? or scheduling?
please help. i am trying hard but unable to solve this.