The patch is detecting cases where llvm wants to create a punpck instruction where the first operand is undefined. Meaning llvm knows that the value for the bits don’t matter. If we find such a case, we just apply the other src register to both inputs instead. The first line of each of the Pats is the source pattern, the second line is the result pattern. In the source pattern you can see the “undef”, in the result pattern you can see that VR128:$src is listed twice.
That makes sense. So if the source and the destination are forced to be the same register and the destination of the previous instruction anyway needs to be the same as the source of the current one then the whole sequence would just be the same register.
Did you run make check-all or make check-llvm? It definitely should affect a bunch of tests in llvm/test/CodeGen/X86/
Ah. I had run just the clang tests. Now I’m seeing the errors. I suppose that they work like “snapshot” tests where you have an expected output and the expected output has changed now that LLVM functions differently. The change could be innocuous or, like the one you found, it could add instructions.
At least for the test case I copied from, the movdqa was needed because the register being copied from was used after the punpck. So it would be a bug for the punpck to overwrite it.
I see. That would fix false positives in Valgrind around that second register, too, but clearly it’s a bad idea to add extra instructions into the code just to satisfy Valgrind’s imperfect memchecker, right? So this seems like a bad solution.
That pass runs before register allocation. Undef here means that when we get to register allocation the allocator has permission to use any register it finds convenient. Unlike valgrind, LLVM knows that the bits that will become X don’t matter so it gave the register allocator permission to use any register. My proposal here wouldn’t handle cases where xmm2 from the punpcklbw is used by another instruction after the punpcklwd that reads xmm3.
That seems great to me! Valgrind is a best-effort memchecker. Valgrind doesn’t have a goal of complete elimination of false positives anyway. I like the idea of clang accommodating Valgrind where possible but not at the expense of performance. Would there be any downside to performance with this change? It adds a constraint on registers, which in general could reduce performance, but if it’s done as you suggest in collectTiedOperands then the constraint shouldn’t have effects beyond the current instruction so maybe it won’t?
Is it difficult to do? Can we try it? I can test it against the Valgrind bug and also on another test case that I have with libboost.