Help me modify llvm optimizer to eliminate valgrind false positive?

I spent a few days debugging a valgrind memcheck false positive that was introduced by a new optimization in clang version 10.

Valgrind is mistakenly reporting an error due to the way that clang 10 is emitting asm. But it’s a trivial matter to fix the code. I just needed to change this:

401697: 66 0f 6e d2 movd %edx,%xmm2
40169b: 66 0f 60 d2 punpcklbw %xmm2,%xmm2
40169f: 66 0f 61 da punpcklwd %xmm2,%xmm3

to this:

4016a3: 66 0f 6e da movd %edx,%xmm3
4016a7: 66 0f 60 db punpcklbw %xmm3,%xmm3
4016ab: 66 0f 61 db punpcklwd %xmm3,%xmm3

So instead of using $xmm2 and then $xmm3 just use $xmm3. This was a tiny change and I was able to modify the executable to do it.

Is it possible to modify clang/llvm to do this where possible? Valgrind is a popular tool for finding bugs in code and if it’s not too much trouble for llvm and there’s no performance penalty, maybe LLVM could accommodate?

I’m willing to do the work but I’m not sure where to start.help but I’m not sure where to start.

There are more details about the issue in the Valgrind bug report: https://bugs.kde.org/show_bug.cgi?id=432801#c12

Thanks for your help!

Eyal

Hi Eyal,

LLVM knows that those bits are garbage and did this consciously. But it does create false execution dependency on xmm3. Potentially delaying the execution depending on what instruction is producing xmm3. This is definitely not the only case that this can happen.

This is probably the easiest fix which avoids ever creating the false dependency by recognizing that there is only one real input and assigning it to both sources.

— a/llvm/lib/Target/X86/X86InstrSSE.td

+++ b/llvm/lib/Target/X86/X86InstrSSE.td

@@ -3917,6 +3917,26 @@ let Constraints = “$src1 = $dst” in {

}

} // ExeDomain = SSEPackedInt

+let Predicates = [UseSSE2] in {
+def : Pat <(v16i8 (X86Unpckl undef, VR128:$src)),

  • (PUNPCKLBWrr VR128:$src, VR128:$src)>;
    +def : Pat <(v8i16 (X86Unpckl undef, VR128:$src)),
  • (PUNPCKLWDrr VR128:$src, VR128:$src)>;
    +def : Pat <(v4i32 (X86Unpckl undef, VR128:$src)),
  • (PUNPCKLDQrr VR128:$src, VR128:$src)>;
    +def : Pat <(v2i64 (X86Unpckl undef, VR128:$src)),
  • (PUNPCKLQDQrr VR128:$src, VR128:$src)>;

It worked! I have spent the last hour reading up on tablegen and I still don’t understand any of that patch. But it definitely solved the problem in my test case. I also ran the clang tests with this change and they all passed so I’m not sure from where you are seeing the change in the behavior around the movdqa. Maybe I ran them incorrectly?

I think that I understand the bit about the delay with pre-Ivybridge. The delay is because those CPUs aren’t able to run at full speed when the mov is followed by the usage. It’s a strange mov, though. Why did it even do that? Couldn’t it just interleave the xmm register with itself?

As for collectTiedOperands, does undef mean that the register has yet to be allocated to a specific register in the machine? What is the situation that you think that Valgrind will still have trouble with?

Thanks for your help!

Eyal

It worked! I have spent the last hour reading up on tablegen and I still don’t understand any of that patch. But it definitely solved the problem in my test case. I also ran the clang tests with this change and they all passed so I’m not sure from where you are seeing the change in the behavior around the movdqa. Maybe I ran them incorrectly?

The patch is detecting cases where llvm wants to create a punpck instruction where the first operand is undefined. Meaning llvm knows that the value for the bits don’t matter. If we find such a case, we just apply the other src register to both inputs instead. The first line of each of the Pats is the source pattern, the second line is the result pattern. In the source pattern you can see the “undef”, in the result pattern you can see that VR128:$src is listed twice.

Did you run make check-all or make check-llvm? It definitely should affect a bunch of tests in llvm/test/CodeGen/X86/

I think that I understand the bit about the delay with pre-Ivybridge. The delay is because those CPUs aren’t able to run at full speed when the mov is followed by the usage. It’s a strange mov, though. Why did it even do that? Couldn’t it just interleave the xmm register with itself?

At least for the test case I copied from, the movdqa was needed because the register being copied from was used after the punpck. So it would be a bug for the punpck to overwrite it.

As for collectTiedOperands, does undef mean that the register has yet to be allocated to a specific register in the machine? What is the situation that you think that Valgrind will still have trouble with?

That pass runs before register allocation. Undef here means that when we get to register allocation the allocator has permission to use any register it finds convenient. Unlike valgrind, LLVM knows that the bits that will become X don’t matter so it gave the register allocator permission to use any register. My proposal here wouldn’t handle cases where xmm2 from the punpcklbw is used by another instruction after the punpcklwd that reads xmm3.

The patch is detecting cases where llvm wants to create a punpck instruction where the first operand is undefined. Meaning llvm knows that the value for the bits don’t matter. If we find such a case, we just apply the other src register to both inputs instead. The first line of each of the Pats is the source pattern, the second line is the result pattern. In the source pattern you can see the “undef”, in the result pattern you can see that VR128:$src is listed twice.

That makes sense. So if the source and the destination are forced to be the same register and the destination of the previous instruction anyway needs to be the same as the source of the current one then the whole sequence would just be the same register.

Did you run make check-all or make check-llvm? It definitely should affect a bunch of tests in llvm/test/CodeGen/X86/

Ah. I had run just the clang tests. Now I’m seeing the errors. I suppose that they work like “snapshot” tests where you have an expected output and the expected output has changed now that LLVM functions differently. The change could be innocuous or, like the one you found, it could add instructions.

At least for the test case I copied from, the movdqa was needed because the register being copied from was used after the punpck. So it would be a bug for the punpck to overwrite it.

I see. That would fix false positives in Valgrind around that second register, too, but clearly it’s a bad idea to add extra instructions into the code just to satisfy Valgrind’s imperfect memchecker, right? So this seems like a bad solution.

That pass runs before register allocation. Undef here means that when we get to register allocation the allocator has permission to use any register it finds convenient. Unlike valgrind, LLVM knows that the bits that will become X don’t matter so it gave the register allocator permission to use any register. My proposal here wouldn’t handle cases where xmm2 from the punpcklbw is used by another instruction after the punpcklwd that reads xmm3.

That seems great to me! Valgrind is a best-effort memchecker. Valgrind doesn’t have a goal of complete elimination of false positives anyway. I like the idea of clang accommodating Valgrind where possible but not at the expense of performance. Would there be any downside to performance with this change? It adds a constraint on registers, which in general could reduce performance, but if it’s done as you suggest in collectTiedOperands then the constraint shouldn’t have effects beyond the current instruction so maybe it won’t?

Is it difficult to do? Can we try it? I can test it against the Valgrind bug and also on another test case that I have with libboost.

Eyal

Nevermind, I figured out how to do it in Valgrind. The solution will make all pcmpgtd instructions two to three times slower than the penalty already imposed by Valgrind but hopefully those instructions are rare enough that it won’t matter.

Whereas previously Valgrind would just assume that any uncertainty in the inputs to pcmpgtd cause uncertainty in the output, it now instead calculates the maximum and minimum possible values for the operands to pcmpgtd. If the max of one is less than the min of the other or vice-versa, we can have certainty in the result. There’s a little extra work to do because the maximum isn’t just setting all unknown bits to 1 because the numbers are in 2’s complement. So instead, all numbers have their MSB flipped before and after the work of computing min and max so that the signed comparison works out right. There is no unsigned vector comparison in SSE2 and Valgrind should not generate instructions beyond the targeted architecture so this is how it’s done.

https://bugs.llvm.org/show_bug.cgi?id=49372 closed as won’t fix.

This is my first time writing x86 asm so I’m not sure if what I’ve written is very efficient. If you have pointers on how I could do it better, I’d appreciate them! https://github.com/eyal0/valgrind/commit/899ea491e358013579f87e443beff0a30c69e348

Eyal