RFC: code size reduction in X86 by replacing EVEX with VEX encoding

Hi All.

This is an RFC for a proposed target specific X86 optimization for reducing code size in the encoding of AVX-512 instructions when possible.

When the AVX512F instruction set was introduced in X86 it included additional 32 registers of 512bit size each ZMM0 - ZMM31, as well as additional 16 XMM registers XMM16-XMM31 and 16 YMM registers YMM16-YMM31.

In order to encode the new registers of 16-31 and the additional instructions, a new encoding prefix called EVEX, which extends the existing VEX encoding, was introduced as shown below:

The EVEX encoding format:

EVEX Opcode ModR/M [SIB] [Disp32] / [Disp8*N] [Immediate]

of bytes: 4 1 1 1 4 / 1 1

The existing VEX encoding format:

[VEX] OPCODE ModR/M [SIB] [DISP] [IMM]

of bytes: 0,2,3 1 1 0,1 0,1,2,4 0,1

Note that the EVEX prefix requires 4 bytes whereas the VEX prefix can take only up to 3 bytes.

Consequently, for the SKX architecture, many instructions that use only the lower registers of XMM0-XMM15 or YMM0-YMM15, can be encoded by either the EVEX or the VEX format. For such cases, using the VEX encoding results in a code size reduction of ~2 bytes even though it is compiled with the AVX512F/AVX512VL features enabled.

For example: “vmovss %xmm0, 32(%rsp,%rax,4)“, has the following 2 possible encodings:

EVEX encoding (8 bytes long):

62 f1 7e 08 11 44 84 08 vmovss %xmm0, 32(%rsp,%rax,4)

VEX encoding (6 bytes long):

c5 fa 11 44 84 20 vmovss %xmm0, 32(%rsp,%rax,4)

See reported Bugzilla bugs about this proposed optimization:

https://llvm.org/bugs/show_bug.cgi?id=23376

https://llvm.org/bugs/show_bug.cgi?id=29162

The proposed optimization implementation is to add a table of all EVEX opcodes that can be encoded via VEX in a new header file placed under lib/Target/X86.

A new pass is to be added at the pre-emit stage.

No need for special Opt flags, as it is always better to use the reduced VEX encoding when possible.

Thank you for any comments or questions that you may have.

Sincerely,

Gadi.

Hi All.

This is an RFC for a proposed target specific X86 optimization for reducing code size in the encoding of AVX-512 instructions when possible.

When the AVX512F instruction set was introduced in X86 it included additional 32 registers of 512bit size each ZMM0 - ZMM31, as well as additional 16 XMM registers XMM16-XMM31 and 16 YMM registers YMM16-YMM31.

In order to encode the new registers of 16-31 and the additional instructions, a new encoding prefix called EVEX, which extends the existing VEX encoding, was introduced as shown below:

The EVEX encoding format:

EVEX Opcode ModR/M [SIB] [Disp32] / [Disp8*N] [Immediate]

of bytes: 4 1 1 1 4 / 1 1

The existing VEX encoding format:

[VEX] OPCODE ModR/M [SIB] [DISP] [IMM]

of bytes: 0,2,3 1 1 0,1 0,1,2,4 0,1

Note that the EVEX prefix requires 4 bytes whereas the VEX prefix can take only up to 3 bytes.

Consequently, for the SKX architecture, many instructions that use only the lower registers of XMM0-XMM15 or YMM0-YMM15, can be encoded by either the EVEX or the VEX format. For such cases, using the VEX encoding results in a code size reduction of ~2 bytes even though it is compiled with the AVX512F/AVX512VL features enabled.

For example: “vmovss %xmm0, 32(%rsp,%rax,4)“, has the following 2 possible encodings:

EVEX encoding (8 bytes long):

62 f1 7e 08 11 44 84 08 vmovss %xmm0, 32(%rsp,%rax,4)

VEX encoding (6 bytes long):

c5 fa 11 44 84 20 vmovss %xmm0, 32(%rsp,%rax,4)

See reported Bugzilla bugs about this proposed optimization:

https://llvm.org/bugs/show_bug.cgi?id=23376

https://llvm.org/bugs/show_bug.cgi?id=29162

The proposed optimization implementation is to add a table of all EVEX opcodes that can be encoded via VEX in a new header file placed under lib/Target/X86.

A new pass is to be added at the pre-emit stage.

I would like a command line option to disable this optimization. That way tests can still verify that EVEX instructions came out of isel by using -show-mc-encoding.

Thanks for the tip.

Indeed, the EVEX opcodes in X86 have a convenient naming that help in this.

Sincerely,

Gadi.

Thanx. This makes sense.

Note that there are many tests, mostly under test/CodeGen/X86, that are affected by this optimization and I had to modify them as they include a check of the generated encoding.

If we add such a disabling opt flag, should we now keep two sets of tests? One for the optimization on and one when it is disabled?

Thanx!

Gadi.

I would like a command line option to disable this optimization. That way tests can still verify that EVEX instructions came out of isel by using -show-mc-encoding.

I think that keeping tests compatibility is not a reason for an additional “llc” flag. We check encoding in test/MC/X86 dir.

Is there any option to report-out from llc in non-debug mode? It should be an option to control internals of llc process…

test/MC/X86 goes thorugh the AsmParser. That’s a different path than isel. I’m worried about not being able to see cases where isel is missing a pattern and causes us to still select a VEX instruction. I’ve fixed many such cases recently and I’m sure there are still more. Since simple tests don’t use the larger register set, the encoding is the only way we can tell what isel is doing.

I’m looking at DiagnosticHandler of llc.

Can we extend it for remarks? It will allows u to print remarks about moving from EVEX to VEX.

What do you think?

  • Elena

For ISel, we can write .ll → .mir tests that check the EVEX flavor is correctly selected.

For example, ‘VADDPDZ256rm’ and ‘VADDPDYrm’ are two instructions that can be differentiated in machine IR , but are both emitted as ‘VADDPD’ in machine assembly.

I did not put this suggestion to test, but I believe it should work.

Hal, that’s a good point. There are more manually-maintained tables in the X86 backend that should probably be tablegened: the memory-folding tables and ReplaceableInstrs, to name a couple.

If you have ideas on how to get these auto-generated, please let us know.

Hal, that’s a good point. There are more manually-maintained tables in the X86 backend that should probably be tablegened: the memory-folding tables and ReplaceableInstrs, to name a couple.

If you have ideas on how to get these auto-generated, please let us know.

Thanks for the elaborate recipe. Created pr31205 to track opportunities for tablegening.