RFC: Insertion of nops for performance stability

Hi all,

These days I am working on a feature designed to insert nops for IA code generation that will provide performance improvements and performance stability. This feature will not affect other architectures. It will, however, set up an infrastructure for other architectures to do the same, if ever needed.

Here are some examples for cases in which nops can improve performance:

  1. DSB (Decoded Stream Buffer) Thrashing.

DSB is a cache for uops that have been decoded. It is limited to 3 ways per 32B window; 4 or more ways will cause a performance degradation. This can be avoided by aligning with nops. See Zia Ansari’s presentation for more information: http://schd.ws/hosted_files/llvmdevelopersmeetingbay2016/d9/LLVM-Conference-US-2016-Code-Alignment.pdf

  1. Two or more branches with the same target in a single instruction window may cause poor branch prediction.

Our recommendation to programmers is to try and keep 2 branches out of the same 16B chunk if they both go to the same target. From the manual:
“Avoid putting two conditional branch instructions in a loop so that both have the same branch target address and, at the same time, belong to (i.e. have their last bytes’ addresses within) the same 16-byte aligned code block.”

Let’s see an example for the second case. Consider the following code:

int y;

int x;

int foo(int i, int m, int p, int q, int *p1, int *p2)

{

if (i+*p1 == p || i-*p2 > m || i == q) {

y++;

x++;

}

return 0;

}

The two last || clauses of the if will be translated to two conditional jumps to the same target. The generated code will look like so:

foo:

0: 48 8b 44 24 28 movq 40(%rsp), %rax

5: 8b 00 movl (%rax), %eax

7: 01 c8 addl %ecx, %eax

9: 44 39 c0 cmpl %r8d, %eax

c: 75 0f jne 15 <foo+0x1D>

e: ff 05 00 00 00 00 incl (%rip)

14: ff 05 00 00 00 00 incl (%rip)

1a: 31 c0 xorl %eax, %eax

1c: c3 retq

1d: 44 39 c9 cmpl %r9d, %ecx

20: 74 ec je -20 <foo+0xE>

22: 48 8b 44 24 30 movq 48(%rsp), %rax

27: 2b 08 subl (%rax), %ecx

29: 39 d1 cmpl %edx, %ecx

2b: 7f e1 jg -31 <foo+0xE>

2d: 31 c0 xorl %eax, %eax

2f: c3 retq

Note: the first if clause jump is the jne instruction at 0x0C, the second if clause jump is the jg instruction at 0x2B and the third if clause jump is the je instruction at 0x20. Also note that the jg and je share a 16 byte window, which is exactly the situation we wish to avoid (consider the case in which foo is called from inside a loop. This will cause performance penalty).

The generated code after my changes will look like so:

foo:

0: 48 8b 44 24 28 movq 40(%rsp), %rax

5: 8b 00 movl (%rax), %eax

7: 01 c8 addl %ecx, %eax

9: 44 39 c0 cmpl %r8d, %eax

c: 75 0f jne 15 <foo+0x1D>

e: ff 05 00 00 00 00 incl (%rip)

14: ff 05 00 00 00 00 incl (%rip)

1a: 31 c0 xorl %eax, %eax

1c: c3 retq

1d: 44 39 c9 cmpl %r9d, %ecx

20: 74 ec je -20 <foo+0xE>

22: 48 8b 44 24 30 movq 48(%rsp), %rax

27: 2b 08 subl (%rax), %ecx

29: 39 d1 cmpl %edx, %ecx

2b: 0f 1f 40 00 nopl (%rax)

2f: 7f dd jg -35 <foo+0xE>

31: 31 c0 xorl %eax, %eax

33: c3 retq

The only change is the nopl at 0x2B, before the jg, which pushes the jg instruction to the next 16 byte window, and thus avoids the undesired situation. Note that, in this case, in order to push an instruction to the next window it is sufficient to push its last byte to the next window.

Due to the nature of those performance nops, which often rely on the layout of the code (e.g. number of instructions in a 16/32B window) that is only known in assembly time, the insertion is divided to two phases:

  1. Marking “suspicious” places while encoding, i.e. hotspots that may cause a performance penalty, during the encoding.

  2. Computing the actual number (zero or greater) of required nops in order to avoid a performance penalty.

In order to achieve that, I am introducing a new kind of MCFragment named MCPerfNopFragment. It will be the connecting link between the two phases:

  1. During encoding, the object streamer (MCObjectStreamer) will query the backend for the need of a MCPerfNopFragment before every encoded instruction. If required, a MCPerfNopFragment will be created and inserted to the MCSection.

  2. During the assembly phase:

a. When computing the fragment size in MCAssembler::computeFragmentSize the Assembler will consult the backend and will provide it the layout in order to determine the required size of the fragment. Note that the computed size may change from call to call, similarly to the process of computing the size of a MCAlignFragment.

b. When writing the fragment the assembler will again use the backend in order to write the required number of nops.

Comments and questions are welcome.

Thanks,

Omer

Hi Omer,

Hi all,

These days I am working on a feature designed to insert nops for IA code generation that will provide performance improvements and performance stability. This feature will not affect other architectures. It will, however, set up an infrastructure for other architectures to do the same, if ever needed.

Here are some examples for cases in which nops can improve performance:

  1. DSB (Decoded Stream Buffer) Thrashing.

DSB is a cache for uops that have been decoded. It is limited to 3 ways per 32B window; 4 or more ways will cause a performance degradation. This can be avoided by aligning with nops. See Zia Ansari’s presentation for more information: http://schd.ws/hosted_files/llvmdevelopersmeetingbay2016/d9/LLVM-Conference-US-2016-Code-Alignment.pdf

  1. Two or more branches with the same target in a single instruction window may cause poor branch prediction.

Our recommendation to programmers is to try and keep 2 branches out of the same 16B chunk if they both go to the same target. From the manual:
“Avoid putting two conditional branch instructions in a loop so that both have the same branch target address and, at the same time, belong to (i.e. have their last bytes’ addresses within) the same 16-byte aligned code block.”

Let’s see an example for the second case. Consider the following code:

int y;

int x;

int foo(int i, int m, int p, int q, int *p1, int *p2)

{

if (i+*p1 == p || i-*p2 > m || i == q) {

y++;

x++;

}

return 0;

}

The two last || clauses of the if will be translated to two conditional jumps to the same target. The generated code will look like so:

foo:

0: 48 8b 44 24 28 movq 40(%rsp), %rax

5: 8b 00 movl (%rax), %eax

7: 01 c8 addl %ecx, %eax

9: 44 39 c0 cmpl %r8d, %eax

c: 75 0f jne 15 <foo+0x1D>

e: ff 05 00 00 00 00 incl (%rip)

14: ff 05 00 00 00 00 incl (%rip)

1a: 31 c0 xorl %eax, %eax

1c: c3 retq

1d: 44 39 c9 cmpl %r9d, %ecx

20: 74 ec je -20 <foo+0xE>

22: 48 8b 44 24 30 movq 48(%rsp), %rax

27: 2b 08 subl (%rax), %ecx

29: 39 d1 cmpl %edx, %ecx

2b: 7f e1 jg -31 <foo+0xE>

2d: 31 c0 xorl %eax, %eax

2f: c3 retq

Note: the first if clause jump is the jne instruction at 0x0C, the second if clause jump is the jg instruction at 0x2B and the third if clause jump is the je instruction at 0x20. Also note that the jg and je share a 16 byte window, which is exactly the situation we wish to avoid (consider the case in which foo is called from inside a loop. This will cause performance penalty).

The generated code after my changes will look like so:

foo:

0: 48 8b 44 24 28 movq 40(%rsp), %rax

5: 8b 00 movl (%rax), %eax

7: 01 c8 addl %ecx, %eax

9: 44 39 c0 cmpl %r8d, %eax

c: 75 0f jne 15 <foo+0x1D>

e: ff 05 00 00 00 00 incl (%rip)

14: ff 05 00 00 00 00 incl (%rip)

1a: 31 c0 xorl %eax, %eax

1c: c3 retq

1d: 44 39 c9 cmpl %r9d, %ecx

20: 74 ec je -20 <foo+0xE>

22: 48 8b 44 24 30 movq 48(%rsp), %rax

27: 2b 08 subl (%rax), %ecx

29: 39 d1 cmpl %edx, %ecx

2b: 0f 1f 40 00 nopl (%rax)

2f: 7f dd jg -35 <foo+0xE>

31: 31 c0 xorl %eax, %eax

33: c3 retq

The only change is the nopl at 0x2B, before the jg, which pushes the jg instruction to the next 16 byte window, and thus avoids the undesired situation. Note that, in this case, in order to push an instruction to the next window it is sufficient to push its last byte to the next window.

Due to the nature of those performance nops, which often rely on the layout of the code (e.g. number of instructions in a 16/32B window) that is only known in assembly time, the insertion is divided to two phases:

  1. Marking “suspicious” places while encoding, i.e. hotspots that may cause a performance penalty, during the encoding.

  2. Computing the actual number (zero or greater) of required nops in order to avoid a performance penalty.

In order to achieve that, I am introducing a new kind of MCFragment named MCPerfNopFragment. It will be the connecting link between the two phases:

  1. During encoding, the object streamer (MCObjectStreamer) will query the backend for the need of a MCPerfNopFragment before every encoded instruction. If required, a MCPerfNopFragment will be created and inserted to the MCSection.

  2. During the assembly phase:

a. When computing the fragment size in MCAssembler::computeFragmentSize the Assembler will consult the backend and will provide it the layout in order to determine the required size of the fragment. Note that the computed size may change from call to call, similarly to the process of computing the size of a MCAlignFragment.

b. When writing the fragment the assembler will again use the backend in order to write the required number of nops.

Comments and questions are welcome.

Hi Stephen,

I actually not sure myself if it's better to insert extra instructions rather than increase the length of existing ones. However, in means of the current design, at least, it can complicate things.
In my solution the insertion of the nops integrates with the process of layout in the Assembler, in which every MCFragment is processed in turn and its layout is determines. When it's time for a MCFragment to be lay out we can rely on the layout of the MCFragments prior to it, but we can't rely on the layout of the MCFragments that follow it since they haven't been laid out yet and thus their layout is not yet known.
For a MCFragment of the new kind MCPerfNopFragment, "laying out" means computing the number of nops it will contain. Luckily enough, all the cases that need nops does not require later instruction data, only earlier instruction data, which is already set and laid out (keep in mind that a MCPerfNopFragment will always appear right before a "potentially hazardous" instruction and will be aware of that instruction).
In order to change instructions to another version of themselves that will take up more bytes we will have to do one of two:
1. Rely on later instruction data when laying out the fragments that contain those instructions. As I mentioned, we can't do this accurately as the layout haven't been set for the fragments that follow.
2. Retroactively change the layout of the fragments that contain those instructions once we encounter a MCPerfNopFragment. This can cause complications as it will require changing the layout while performing a layout. Some fragments in between the currently computed MCPerfNopFragment and the MCFragment we wish to change layout for may rely on the already determined layout. MCAlignFragments, for example, strongly rely on the layout of their predecessors. The entire process of fragments layout will then become very complex.

Please tell me if you have any thoughts as to how to get around this issues.

Thanks,
Omer

Hi Hal,

A pre-emit pass will indeed be preferable. I originally thought of it, too, however I could not figure out how can such a pass have an access to information on instruction sizes and block alignments.

I know that for X86, at least, the branch relaxation is happening during the layout phase in the Assembler, where I plan to integrate the nop insertion such that the new MCPerfNopFragment will just be yet another kind of fragment to handle when laying out all the fragments: layout for a relaxable branch is relaxing it if necessary, and layout for a MCPerfNopFragment will be computing the number of nops it will contain.

Can you please refer me to a pre-emit pass that does something similar?

Thanks,

Omer

Hi Hal,

A pre-emit pass will indeed be preferable. I originally thought of it, too, however I could not figure out how can such a pass have an access to information on instruction sizes and block alignments.

I know that for X86, at least, the branch relaxation is happening during the layout phase in the Assembler, where I plan to integrate the nop insertion such that the new MCPerfNopFragment will just be yet another kind of fragment to handle when laying out all the fragments: layout for a relaxable branch is relaxing it if necessary, and layout for a MCPerfNopFragment will be computing the number of nops it will contain.

Can you please refer me to a pre-emit pass that does something similar?

Hi Hal,

A pre-emit pass will indeed be preferable. I originally thought of it, too, however I could not figure out how can such a pass have an access to information on instruction sizes and block alignments.

I know that for X86, at least, the branch relaxation is happening during the layout phase in the Assembler, where I plan to integrate the nop insertion such that the new MCPerfNopFragment will just be yet another kind of fragment to handle when laying out all the fragments: layout for a relaxable branch is relaxing it if necessary, and layout for a MCPerfNopFragment will be computing the number of nops it will contain.

Can you please refer me to a pre-emit pass that does something similar?

Hi Hal,

Thanks for the reference. I’ve looked at PPCBranchSelector and the PowerPC backend. It is very different from the X86 architecture and unfortunately the way branch relaxation and alignment related issues are handled in PPC cannot be copied to X86. This is because:

  1. PPC instructions are of fixed length while X86 instructions are of variable length, and their length can change depending on features the target machine has. Not only that, but a relaxed X86 instruction will differ in size from its non-relaxed version. The function TrargetInstrInfo::getInstSizeInBytes is not, and can’t be, implemented for X86. It follows that:

  2. Target addresses of branches cannot be computed before the code emission in X86. That’s why X86 relies on the layout process in MCAssembler, which occurs after the encoding is done and (amongst the rest) relaxes instructions that needs relaxation.

For the same reasons, the nops insertion process also cannot occur until after the encoding phase.

Regarding splitting blocks of code, that is essentially what I’m proposing, but at MC level (since this is the level I’m operating in). When needed, a MCPerfNopFragment will be a separator between two MCFragments that contain code (e.g. MCDataFragments). However, a MCPerfNopFragment is not a MCAlignFragment and a block that follows a MCDataFragments will not (necessarily) be aligned as it could be wasteful in terms of code size. Consider the example I provided. Aligning the fragment after the nops to 16B will cost one additional byte of nops in code size, and aligning it to 32B will cost 17 additional bytes of nops in code size. There’s no benefit in this extra bytes.

As for filling the slots with other instructions, I think it won’t be right, because

  1. It will require some sort of branch prediction unit that will work in compilation time. Branch predicting is done in run time by the machine, and involves many parameters. In my opinion, trying to imitate such a mechanism will create either very complex code or inaccurate predictions.

  2. Execution of those other instructions can be very expensive compared to the execution of the nops. This means that wrong (compile-time) branch prediction will result in a performance penalty that may even overshadow the benefit from the spacing those instructions provide.

  3. As I mentioned, this process has to be done in the Assembler, after all the passes and the code emission. In my opinion, the Assembler phase is not a good time to change locations of instructions.

Thanks,

Omer

Hi Hal,

Thanks for the reference. I’ve looked at PPCBranchSelector and the PowerPC backend. It is very different from the X86 architecture and unfortunately the way branch relaxation and alignment related issues are handled in PPC cannot be copied to X86. This is because:

  1. PPC instructions are of fixed length while X86 instructions are of variable length, and their length can change depending on features the target machine has. Not only that, but a relaxed X86 instruction will differ in size from its non-relaxed version. The function TrargetInstrInfo::getInstSizeInBytes is not, and can’t be, implemented for X86. It follows that:

  2. Target addresses of branches cannot be computed before the code emission in X86. That’s why X86 relies on the layout process in MCAssembler, which occurs after the encoding is done and (amongst the rest) relaxes instructions that needs relaxation.

For the same reasons, the nops insertion process also cannot occur until after the encoding phase.

Hi Paparo.

I know this has been almost one year. I would like to know whether you have plan to contribute this back ? We have some performance instability that would benefit from a pass like this in Facebook.Thanks.
-Xin

Hi Xin,

The first phase of this work was uploaded for review in two parts:

https://reviews.llvm.org/D34393

https://reviews.llvm.org/D34396

The reviews are currently pending.

Thanks,

Omer