Issue
The instruction-size limit of 15 bytes still applies to APX instructions.
(APX SPEC: https://cdrdv2.intel.com/v1/dl/getContent/784266)
Note that it is possible for an EVEX-encoded legacy instruction to reach the 15-byte instruction length limit: 4
bytes of EVEX prefix + 1 byte of opcode + 1 byte of ModRM + 1 byte of SIB + 4 bytes of displacement + 4
bytes of immediate = 15 bytes in total, e.g.
addq $184, -96, %rax # encoding: [0x62,0xf4,0xfc,0x18,0x81,0x04,0x25,0xa0,0xff,0xff,0xff,0xb8,0x00,0x00,0x00]
If we added a segment prefix like fs, the length would be 16.
In such a case, no additional (ADSIZE or segment override) prefix can be used. This limit is possibly reached by following instructions
- OP32mi_ND, OP32mi_NF, OP32mi_NF_ND, OP32mi_EVEX
- OP64mi32_ND, OP64mi32_NF, OP64mi32_NF_ND, OP64mi32_EVEX
- CCMP32mi, CTEST32mi, CCMP64mi, CTEST64mi32
- IMUL32rmi_NF, IMUL32rmi_EVEX, IMUL64rmi32_NF, IMUL64rmi32_EVEX
where OP can be AND, OR, XOR, ADD, SUB, ADC and SBB.
(See llvm-project/llvm/lib/Target/X86/X86InstrArithmetic.td at main · llvm/llvm-project (github.com) for the definitions)
Let’s cut out some unlikely situations first.
- APX is supported only in 64-bit mode, where compiler only use 64-bit memory addressing (namely base and index register are 64-bit, no ADSIZE)
- _EVEX variants for above instructions are redundant encoding and compiler never emits them.
I believe we don’t need to handle the above 2 cases in LLVM. But segment override prefix is real issue for instruction size b/c it’s probably used when we access a TLS variable, etc. We encountered this issue when building Node.js with APX.
Proposal
To resolve the issue of X86 instruction-size limit, I propose to split the “long” instruction into 2 instructions. This split should be done after (not rightly after) register allocation b/c whether SIB is required depends on the base register.
The split is :first load the memory into a register and then apply the desired prefix(es) to the shorter ri version of the same instruction class. e.g.
subq $184, %fs:257(%rbx, %rcx), %rax
->
movq %fs:257(%rbx, %rcx),%rax
subq $184, %rax
it’s prefered than
movq %184, %rax
subq %rax, %fs:257(%rbx, %rcx), %rax
b/c
- the former is shorter
- the base/index register may be same as the dest register, then the latter one requires a scratch register
- IMUL does not have rmr variant
- HW supports immediate folding
Relevant topic
I plan to implement TargetInstrInfo::getInstSizeInBytes() for X86 for the split.
Like AArch64, it returns the maximum number of bytes.
Question
Where should I do this split?
a. Rename “X86 Byte/Word Instruction Fixup” and do it there
b. X86 pseudo instruction expansion pass
c. Create a new pass