RFC: code size reduction in X86 by replacing EVEX with VEX encoding

Haber_Gadi · November 23, 2016, 11:50am

Hi All.

This is an RFC for a proposed target specific X86 optimization for reducing code size in the encoding of AVX-512 instructions when possible.

When the AVX512F instruction set was introduced in X86 it included additional 32 registers of 512bit size each ZMM0 - ZMM31, as well as additional 16 XMM registers XMM16-XMM31 and 16 YMM registers YMM16-YMM31.

In order to encode the new registers of 16-31 and the additional instructions, a new encoding prefix called EVEX, which extends the existing VEX encoding, was introduced as shown below:

The EVEX encoding format:

EVEX Opcode ModR/M [SIB] [Disp32] / [Disp8*N] [Immediate]

of bytes: 4 1 1 1 4 / 1 1

The existing VEX encoding format:

[VEX] OPCODE ModR/M [SIB] [DISP] [IMM]

of bytes: 0,2,3 1 1 0,1 0,1,2,4 0,1

Note that the EVEX prefix requires 4 bytes whereas the VEX prefix can take only up to 3 bytes.

Consequently, for the SKX architecture, many instructions that use only the lower registers of XMM0-XMM15 or YMM0-YMM15, can be encoded by either the EVEX or the VEX format. For such cases, using the VEX encoding results in a code size reduction of ~2 bytes even though it is compiled with the AVX512F/AVX512VL features enabled.

For example: “vmovss %xmm0, 32(%rsp,%rax,4)“, has the following 2 possible encodings:

EVEX encoding (8 bytes long):

62 f1 7e 08 11 44 84 08 vmovss %xmm0, 32(%rsp,%rax,4)

VEX encoding (6 bytes long):

c5 fa 11 44 84 20 vmovss %xmm0, 32(%rsp,%rax,4)

See reported Bugzilla bugs about this proposed optimization:

https://llvm.org/bugs/show_bug.cgi?id=23376

https://llvm.org/bugs/show_bug.cgi?id=29162

The proposed optimization implementation is to add a table of all EVEX opcodes that can be encoded via VEX in a new header file placed under lib/Target/X86.

A new pass is to be added at the pre-emit stage.

No need for special Opt flags, as it is always better to use the reduced VEX encoding when possible.

Thank you for any comments or questions that you may have.

Sincerely,

Gadi.

Finkel_Hal_J · November 23, 2016, 1:01pm

Hi All.

This is an RFC for a proposed target specific X86 optimization for reducing code size in the encoding of AVX-512 instructions when possible.

When the AVX512F instruction set was introduced in X86 it included additional 32 registers of 512bit size each ZMM0 - ZMM31, as well as additional 16 XMM registers XMM16-XMM31 and 16 YMM registers YMM16-YMM31.

In order to encode the new registers of 16-31 and the additional instructions, a new encoding prefix called EVEX, which extends the existing VEX encoding, was introduced as shown below:

The EVEX encoding format:

EVEX Opcode ModR/M [SIB] [Disp32] / [Disp8*N] [Immediate]

of bytes: 4 1 1 1 4 / 1 1

The existing VEX encoding format:

[VEX] OPCODE ModR/M [SIB] [DISP] [IMM]

of bytes: 0,2,3 1 1 0,1 0,1,2,4 0,1

Note that the EVEX prefix requires 4 bytes whereas the VEX prefix can take only up to 3 bytes.

Consequently, for the SKX architecture, many instructions that use only the lower registers of XMM0-XMM15 or YMM0-YMM15, can be encoded by either the EVEX or the VEX format. For such cases, using the VEX encoding results in a code size reduction of ~2 bytes even though it is compiled with the AVX512F/AVX512VL features enabled.

For example: “vmovss %xmm0, 32(%rsp,%rax,4)“, has the following 2 possible encodings:

EVEX encoding (8 bytes long):

62 f1 7e 08 11 44 84 08 vmovss %xmm0, 32(%rsp,%rax,4)

VEX encoding (6 bytes long):

c5 fa 11 44 84 20 vmovss %xmm0, 32(%rsp,%rax,4)

See reported Bugzilla bugs about this proposed optimization:

https://llvm.org/bugs/show_bug.cgi?id=23376

https://llvm.org/bugs/show_bug.cgi?id=29162

The proposed optimization implementation is to add a table of all EVEX opcodes that can be encoded via VEX in a new header file placed under lib/Target/X86.

A new pass is to be added at the pre-emit stage.

topperc · November 23, 2016, 4:12pm

I would like a command line option to disable this optimization. That way tests can still verify that EVEX instructions came out of isel by using -show-mc-encoding.

Haber_Gadi · November 24, 2016, 7:22am

Thanks for the tip.

Indeed, the EVEX opcodes in X86 have a convenient naming that help in this.

Sincerely,

Gadi.

Haber_Gadi · November 24, 2016, 7:27am

Thanx. This makes sense.

Note that there are many tests, mostly under test/CodeGen/X86, that are affected by this optimization and I had to modify them as they include a check of the generated encoding.

If we add such a disabling opt flag, should we now keep two sets of tests? One for the optimization on and one when it is disabled?

Thanx!

Gadi.

Demikhovsky_Elena · November 24, 2016, 8:20am

I would like a command line option to disable this optimization. That way tests can still verify that EVEX instructions came out of isel by using -show-mc-encoding.

I think that keeping tests compatibility is not a reason for an additional “llc” flag. We check encoding in test/MC/X86 dir.

Is there any option to report-out from llc in non-debug mode? It should be an option to control internals of llc process…

topperc · November 24, 2016, 2:30pm

test/MC/X86 goes thorugh the AsmParser. That’s a different path than isel. I’m worried about not being able to see cases where isel is missing a pattern and causes us to still select a VEX instruction. I’ve fixed many such cases recently and I’m sure there are still more. Since simple tests don’t use the larger register set, the encoding is the only way we can tell what isel is doing.

Demikhovsky_Elena · November 27, 2016, 1:18pm

I’m looking at DiagnosticHandler of llc.

Can we extend it for remarks? It will allows u to print remarks about moving from EVEX to VEX.

What do you think?

Elena

Rackover_Zvi · November 28, 2016, 2:38pm

For ISel, we can write .ll → .mir tests that check the EVEX flavor is correctly selected.

For example, ‘VADDPDZ256rm’ and ‘VADDPDYrm’ are two instructions that can be differentiated in machine IR , but are both emitted as ‘VADDPD’ in machine assembly.

I did not put this suggestion to test, but I believe it should work.

Rackover_Zvi · November 28, 2016, 2:50pm

Hal, that’s a good point. There are more manually-maintained tables in the X86 backend that should probably be tablegened: the memory-folding tables and ReplaceableInstrs, to name a couple.

If you have ideas on how to get these auto-generated, please let us know.

Finkel_Hal_J · November 28, 2016, 4:55pm

Hal, that’s a good point. There are more manually-maintained tables in the X86 backend that should probably be tablegened: the memory-folding tables and ReplaceableInstrs, to name a couple.

If you have ideas on how to get these auto-generated, please let us know.

Rackover_Zvi · November 29, 2016, 3:09pm

Thanks for the elaborate recipe. Created pr31205 to track opportunities for tablegening.

Topic		Replies	Views
Implementation of encoding scheme LLVM Dev List Archives	0	71	October 1, 2017
Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section LLVM Dev List Archives	15	300	January 7, 2018
[RFC] Support long instruction fixup for X86 X86	9	504	May 19, 2024
broken LLVM-MC? LLVM Dev List Archives	2	155	December 16, 2013
[PATCH / PROPOSAL] bitcode encoding that is ~15% smaller for large bitcode files... LLVM Dev List Archives	19	200	October 11, 2012

RFC: code size reduction in X86 by replacing EVEX with VEX encoding

of bytes: 4 1 1 1 4 / 1 1

of bytes: 0,2,3 1 1 0,1 0,1,2,4 0,1

of bytes: 4 1 1 1 4 / 1 1

of bytes: 0,2,3 1 1 0,1 0,1,2,4 0,1

Related topics