[PATCH] D50328: [X86][SSE] Combine (some) target shuffles with multiple uses

[NOTE: Removed Phab and reviewers]

================
Comment at: test/CodeGen/X86/2012-01-12-extract-sv.ll:12
+; CHECK-NEXT: vblendps {{.*#+}} xmm1 = xmm1[0],xmm2[1,2,3]
+; CHECK-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[0,0,0,0]
; CHECK-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0
----------------
greened wrote:

Can we make this test less brittle by using FileCheck variables?
This goes for pretty much every test in this patch.

I'm sorry but no - its been repeatedly proven that using
update_llc_test_checks.py on the majority of x86 tests is the way
forward - it speeds up creation of tests (x86 by far has the highest
test coverage), makes regeneration of checks trivial and it prevents
dodgy code being 'hidden' (either on purpose or by
accident). Additionally many x86 subtargets have different instruction
behaviours depending on the registers used so hidng the registers
behind regexps make it that more difficult to track.

Ok, humor me a bit, as I am not at all familiar with
update_llc_test_checks.py.

How is it "proven" that the script is better than writing robust and
focused tests? IME, updating tests to reflect changed behavior is
problematic. How do we know the updates reflect "better" compiler
behavior? Suppose I make a change and a bunch of llc tests fail. What
do I do? I have to examine each one to see if the new behavior is
"correct" or not. Assuming I think it is, I guess I run
update_llc_test_checks.py. But I didn't write the test so how do I know
the new output is "correct?"

How does the script speed up creation of new tests?

What kinds of dodgy code being hidden are you concerned about? Can you
provide an example?

If a test is looking for different behavior based on register use, then
of course the test should be written to detect specific register uses.
If the register use changes, the test fails because a bug was
introduced.

I would bet 95% of X86 tests don't care what registers are used
whatsoever. From my perspective, it would be much better to have 95% of
tests never change over time, leaving a much smaller set of tests that
may never (ideally will not ever) need to be changed, because they are
written to expect specific sequences and violations of those are bugs.

Tests should be focused. If we're using, for example, a bunch of tests
develped for shuffle selection as proxies for register allocation,
that's not good. We should have specific register allocation tests.

Is this unreasonable? Can we not incrementally improve the tests?

                              -David

[NOTE: Removed Phab and reviewers]

================
Comment at: test/CodeGen/X86/2012-01-12-extract-sv.ll:12
+; CHECK-NEXT: vblendps {{.*#+}} xmm1 = xmm1[0],xmm2[1,2,3]
+; CHECK-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[0,0,0,0]
  ; CHECK-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0
----------------
greened wrote:

Can we make this test less brittle by using FileCheck variables?
This goes for pretty much every test in this patch.

I'm sorry but no - its been repeatedly proven that using
update_llc_test_checks.py on the majority of x86 tests is the way
forward - it speeds up creation of tests (x86 by far has the highest
test coverage), makes regeneration of checks trivial and it prevents
dodgy code being 'hidden' (either on purpose or by
accident). Additionally many x86 subtargets have different instruction
behaviours depending on the registers used so hidng the registers
behind regexps make it that more difficult to track.

Ok, humor me a bit, as I am not at all familiar with
update_llc_test_checks.py.

How is it "proven" that the script is better than writing robust and
focused tests? IME, updating tests to reflect changed behavior is
problematic. How do we know the updates reflect "better" compiler
behavior? Suppose I make a change and a bunch of llc tests fail. What
do I do? I have to examine each one to see if the new behavior is
"correct" or not. Assuming I think it is, I guess I run
update_llc_test_checks.py. But I didn't write the test so how do I know
the new output is "correct?"

Changing a test's IR to avoid an issue in a patch is very problematic, but if any test's codegen changes because of a patch then it just needs to be reviewed, preferably by someone who has touched that test in the past.

How does the script speed up creation of new tests?

By automating the creation of the CHECKs, allowing the developer to focus on writing/reducing the IR. There is no way we could have added the exhaustive tests for much of the custom lowering (vector and scalar codegen) across so many x86 subtargets without the scripts to do the checks, it would have taken far too long - people would simply not spend the time to add as much coverage and we would end up missing things.

What kinds of dodgy code being hidden are you concerned about? Can you
provide an example?

It happens in many ways, for example: often in the intro/exit code that isn't considered directly relevant; or it checks that exactly one instance of an instruction occurs but doesn't consider that a similar instruction might be there as well (common with shuffles); or failure to completely show how 'pre-committed' tests are changed by a patch. The update script helps ensure we have a whole view of the codegen to avoid these, and then in review more eyes can check it.

If a test is looking for different behavior based on register use, then
of course the test should be written to detect specific register uses.
If the register use changes, the test fails because a bug was
introduced.

I would bet 95% of X86 tests don't care what registers are used
whatsoever. From my perspective, it would be much better to have 95% of
tests never change over time, leaving a much smaller set of tests that
may never (ideally will not ever) need to be changed, because they are
written to expect specific sequences and violations of those are bugs.

True - many test cases don't care about the exact register used. But register/codegen changes indicates a change in behaviour that can be investigated and help ensure the patch is doing what you want.

Tests should be focused. If we're using, for example, a bunch of tests
develped for shuffle selection as proxies for register allocation,
that's not good. We should have specific register allocation tests.

I don't know of any examples where we are doing anything so brittle (although many bug tests could be testing for that "perfect storm", you can only reduce so far) - we have instructions that can only use certain registers (e.g. DIV/MUL, PBLENDV's implicit use of xmm0 or EVEX's upper 15 zmm registers) and ensuring that we handle that efficiently can be more than a regalloc only issue.

Is this unreasonable? Can we not incrementally improve the tests?

Ensuring tests are focussed and continue to test what they are supposed to is never unreasonable.

Simon Pilgrim <llvm-dev@redking.me.uk> writes:

Changing a test's IR to avoid an issue in a patch is very problematic,
but if any test's codegen changes because of a patch then it just
needs to be reviewed, preferably by someone who has touched that test
in the past.

But wouldn't it be even better if that output didn't need to be changed
at all and therefore didn't need to be reviewed? Right now a lot of X86
changes have a lot of noise in the diff due to test output (asm)
changes. A fair majority of those changes are incidental to a given
patch.

How does the script speed up creation of new tests?

By automating the creation of the CHECKs, allowing the developer to
focus on writing/reducing the IR. There is no way we could have added
the exhaustive tests for much of the custom lowering (vector and
scalar codegen) across so many x86 subtargets without the scripts to
do the checks, it would have taken far too long - people would simply
not spend the time to add as much coverage and we would end up missing
things.

Ok, that sounds interesting. The comment at the top of the script
implies that it takes existing tests with CHECK lines and updates them.
Is it also somehow able to *create* CHECK lines from raw asm output?
I'm just trying to understand the tool.

Does the script simply pull in the latest asm and re-emit CHECK lines
with the raw asm? If so, I can see why editing tests to remove
hard-coded registers might be scary. It would mean someone would have
to go back in and change all the hard-coded names back to variables
every time the script is run. Is there some way we can improve the
script to do this automatically?

Auto-generating stuff is nice. But I don't think we should say that it
replaces work to make small, focused and robust tests. Testing is hard.

What kinds of dodgy code being hidden are you concerned about? Can you
provide an example?

It happens in many ways, for example: often in the intro/exit code
that isn't considered directly relevant; or it checks that exactly one
instance of an instruction occurs but doesn't consider that a similar
instruction might be there as well (common with shuffles); or failure
to completely show how 'pre-committed' tests are changed by a
patch. The update script helps ensure we have a whole view of the
codegen to avoid these, and then in review more eyes can check it.

I guess I don't see how hard-coding register names helps with those
kinds of things. Ideally "pre-committed" code would not change many
tests at all.

If someone is worried about prolog/epilog code then there should be
tests specifically targeted to that. Not every test needs to include
CHECK lines for that stuff.

I am not suggesting tests need to be perfect, just that we should try to
make them less brittle as we go along, incrementally.

If a test is looking for different behavior based on register use, then
of course the test should be written to detect specific register uses.
If the register use changes, the test fails because a bug was
introduced.

I would bet 95% of X86 tests don't care what registers are used
whatsoever. From my perspective, it would be much better to have 95% of
tests never change over time, leaving a much smaller set of tests that
may never (ideally will not ever) need to be changed, because they are
written to expect specific sequences and violations of those are bugs.

True - many test cases don't care about the exact register used. But
register/codegen changes indicates a change in behaviour that can be
investigated and help ensure the patch is doing what you want.

But why should everyone review lots of lines that just don't matter? I
find that when I look at X86 changes I tend to gloss over test changes
because there are just so many of them and I am not at all familiar with
what the tests are actually testing. We all have limited time. I could
argue that the excessive noise in diffs actually makes it more likely
that problems creep in.

Tests should be focused. If we're using, for example, a bunch of tests
develped for shuffle selection as proxies for register allocation,
that's not good. We should have specific register allocation tests.

I don't know of any examples where we are doing anything so brittle
(although many bug tests could be testing for that "perfect storm",
you can only reduce so far) - we have instructions that can only use
certain registers (e.g. DIV/MUL, PBLENDV's implicit use of xmm0 or
EVEX's upper 15 zmm registers) and ensuring that we handle that
efficiently can be more than a regalloc only issue.

Isn't the very change under discussion an example of this? Lots of
tests have changed hunks that needn't be there. That's brittle.

ISA register assignment requirements should, again, have tests
specifically to ensure we don't break it. I assume we already have many
of them.

Is this unreasonable? Can we not incrementally improve the tests?

Ensuring tests are focussed and continue to test what they are
supposed to is never unreasonable.

Good! :slight_smile:

Perhaps we can improve the tool to make maintenance easier.

                                  -David

I'm working on the RISC-V port and I find the tests extremely brittle and
what should be independent tests getting "broken" (not really broken, just
codegen changed a little) all the time.

Some examples I've come across recently:

- 128x128 multiply on rv32. Generates several dozen inline instructions.
Any change in which temporary register the compiler picks for any single
instruction breaks it. Any instruction scheduling change breaks it. An
operational test checking the results of some random values and some
extreme values on qemu would be much better I think. All the major llvm
target ISAs are supported in qemu, and I suppose the rest have some other
emulator.

- several months ago the code generator started using an alias "ret"
instead of "jalr x0,x1,0" (jump to the address in register x1 (+0 offset)
and store the old PC in register x0 i.e. discard it). That broke literally
every test. Other recent changes such as outputting "mv a,b" instead of
"addi a,b,0" broke tests all over the place, though of course fewer of them.

- another patch (not yet merged to master) outputs stack use metadata just
after the addi that adjust the stack pointer at the start of a function.
It's not even actual code, but it breaks every test that makes a stack
frame (thankfully most tests are too simple to need one).

Yes, there are scripts to automatically regenerate tests. Hopefully after
making sure that the changes to the output are not in fact bugs :slight_smile: It
seems pretty cumbersome to use the output from llvm-lit to track down what
it's actually complaining about. I'm gravitating to just running the update
script for all tests and then using git diff to see what it changed. And,
as mentioned, the diffs in the tests are often orders of magnitude bigger
than the diffs for the actual code.

Is this the same for all ISAs? Is it really the best way? Maybe it calms
down once a back-end is more mature.

Bruce Hoult <brucehoult@sifive.com> writes:

I'm working on the RISC-V port and I find the tests extremely brittle
and what should be independent tests getting "broken" (not really
broken, just codegen changed a little) all the time.

You wrote up some great examples. That's the kind of trouble I hope we
can avoid in the future.

Yes, there are scripts to automatically regenerate tests. Hopefully
after making sure that the changes to the output are not in fact bugs :
-) It seems pretty cumbersome to use the output from llvm-lit to track
down what it's actually complaining about. I'm gravitating to just
running the update script for all tests and then using git diff to see
what it changed. And, as mentioned, the diffs in the tests are often
orders of magnitude bigger than the diffs for the actual code.

Right. The fact that we have to check all of those changes before
committing slows down progress. Maybe "it's no big deal" for an
individual change, but over time lots of cycles are lost examining
changes that are "fine" but need to be verified anyway.

Is this the same for all ISAs? Is it really the best way? Maybe it
calms down once a back-end is more mature.

I don't think it does. As more tests get added, it's more tests to
potentially break and therefore more manual effort to regenerate them
and verify correctness. Code generation will keep changing because
compilers are never "done." And thank goodness for that! :slight_smile:

                          -David