X86 LowerVECTOR_SHUFFLE Question

In ToT, LowerVECTOR_SHUFFLE for x86 has this code:

  if (X86::isUNPCKLMask(SVOp))
    getTargetShuffleNode(getUNPCKLOpcode(VT) dl, VT, V1, V2, DAG);

why would this not be:

  if (X86::isUNPCKLMask(SVOp))
    return SVOp;

I'm trying to add support for VUNPCKL and am getting into trouble
because the existing code ends up creating:

VUNPCKLPS
  load
  load

which is badness come selection time. Legalize doesn't get a chance to
look below the target shuffle node to see that there are two memory
operands.

Back in the 2.7 days, we used to just return the shuffle as is if it was
already legal. Why the change to create a target node? Will there be
problems if I change the code to return the shuffle as-is?

Thanks!

                       -Dave

David Greene <dag@cray.com> writes:

In ToT, LowerVECTOR_SHUFFLE for x86 has this code:

  if (X86::isUNPCKLMask(SVOp))
    getTargetShuffleNode(getUNPCKLOpcode(VT) dl, VT, V1, V2, DAG);

why would this not be:

  if (X86::isUNPCKLMask(SVOp))
    return SVOp;

Ok, I discovered that Bruno did this in revisions 112934, 112942 and
113020 but the logs don't really make clear why. I did figure out that
I needed new SDNode defs for VUNPCKLPSY and VUNPCKLPDY and corresponding
patterns. Once I added them everything started working.

I found this all very confusing because it appears there are now two
ways to match certain shuffle instructions in .td files: one through the
traditional shuffle operators like unpckl and shufp and another through
these special X86* operators.

This is reflected in X86InstrSSE.td:

"Traditional":

    defm VUNPCKLPS: sse12_unpack_interleave<0x14, unpckl, v4f32, memopv4f32,
          VR128, f128mem, "unpcklps\t{$src2, $src1, $dst|$dst, $src1, $src2}",
                         >, VEX_4V;
"New-style":

    def : Pat<(v4f32 (X86Unpcklps VR128:$src1, (memopv4f32 addr:$src2))),
              (VUNPCKLPSrm VR128:$src1, addr:$src2)>, Requires<[HasAVX]>;

I think these are basically the same pattern.

What's the purpose of these special operators and patterns?

                              -Dave

It is inefficient and error-prone to recognize legal shuffles and then have isel repeat the process. For example, if the DAG combiner changes a shuffle in between legalization and isel, it may stop being legal and break isel. By legalizing to target-specific DAG nodes, we avoid that possibility and also make it much easier to match the shuffles during isel.

Bob Wilson <bob.wilson@apple.com> writes:

It is inefficient and error-prone to recognize legal shuffles and then
have isel repeat the process. For example, if the DAG combiner
changes a shuffle in between legalization and isel, it may stop being
legal and break isel. By legalizing to target-specific DAG nodes, we
avoid that possibility and also make it much easier to match the
shuffles during isel.

Why is DAG combiner creating illegal nodes after legalization? It runs
in various modes. If it creates such a node, it's a bug I think.

Doing the matching this way means that we have to duplicate patterns
and/or we lose opportunities for further pattern matching (multiple
shuffles, etc.). Is it really worth the maintenance cost? The isel
matcher has to look over the tree either way. What exactly is the
inefficiency cost?

In the experience I just had, it is quite error-prone to have multiple
tblgen patterns to match these things. The way things were before,
there was a clean separation between checking/enforcing node legality
and doing the final code selection, with isel being automatic through
tblgen. That was nice. The current setup mixes the two and seems to
result in more code in the form of additional tblgen patterns. We also
lose the ability to do shuffle peeps or any other such things unless we
teach the code about every type of special target node.

It really doesn't seem worth it to me.

                                -Dave

In the experience I just had, it is quite error-prone to have multiple
tblgen patterns to match these things. The way things were before,
there was a clean separation between checking/enforcing node legality
and doing the final code selection, with isel being automatic through
tblgen. That was nice. The current setup mixes the two and seems to
result in more code in the form of additional tblgen patterns. We also
lose the ability to do shuffle peeps or any other such things unless we
teach the code about every type of special target node.

It really doesn't seem worth it to me.

In the way it was done before, every shuffle that we tried to match
had to be checked twice (masks used to be checked during legalization
and during isel by the tblgen patterns), this is done only once now
(during legalization). Although we still match the node itself through
tblgen patterns, the tablegen patterns are a lot more clear now, and
we were able to remove lots of confusing code. The code used to be
*very* fragile, during legalization there was no explicit rule or
comments of what and when stuff needed to be matched (and changing a
single rule would generate a code that would fail to match tblgen
patterns later), all the logic was getting really hard to understand.
The long term plan is: generate all target specific nodes during
legalization, and once all logic is clear, we can go for more fancy
stuff like having per-processor family shuffle tables
describing the more profitable shuffles for a specific mask, and so
on. The work stopped because of this bug:
http://llvm.org/bugs/show_bug.cgi?id=8156, but IMHO the implementation
of x86 shuffle matching is a lot more clear now then they used to be
in the past.

Bruno Cardoso Lopes <bruno.cardoso@gmail.com> writes:

It really doesn't seem worth it to me.

In the way it was done before, every shuffle that we tried to match
had to be checked twice (masks used to be checked during legalization
and during isel by the tblgen patterns),

Right.

this is done only once now (during legalization).

Maybe. We still have the old operators like unpck and shup, so couldn't
those still trigger? Shouldn't we remove them if we're using this
TargetNode method?

Is it very expensive to check masks, in the grand scheme of things?

Although we still match the node itself through tblgen patterns, the
tablegen patterns are a lot more clear now, and we were able to remove
lots of confusing code.

That's more to do with the way the patterns grew organically than where
the matching happens.

The code used to be *very* fragile, during legalization there was no
explicit rule or comments of what and when stuff needed to be matched
(and changing a single rule would generate a code that would fail to
match tblgen patterns later),

Can you give an example? When I run into something like that, it's
almost always because a useful pattern is missing.

all the logic was getting really hard to understand.

Oh yes, it's messy stuff all right. But as I'm moving patches from 2.7
to trunk, I'm not noticing much improvement in the legalization code.
Probably it's still a work in progress, yes?

The long term plan is: generate all target specific nodes during
legalization, and once all logic is clear, we can go for more fancy
stuff like having per-processor family shuffle tables describing the
more profitable shuffles for a specific mask, and so on. The work
stopped because of this bug:
http://llvm.org/bugs/show_bug.cgi?id=8156,

Hmm...Again, isn't this due to the matching code being moved from isel
to legalize? It seems like legalize is trying to do too much. It's not
just "legalize" anymore, it's "legalize and optimize."

When I added support for AVX shuffles (which I am upstreaming right
now), I tried to do it in a systematic way. Maybe it will help to
reorganize the other checks in a similar fashion. I haven't really
thought about the bigger picture enough yet.

but IMHO the implementation of x86 shuffle matching is a lot more
clear now then they used to be in the past.

There's certainly been improvement on the TableGen side of things. I
really liked the unpck*, shufp, etc. nodes and the ShuffleVectorSDNode.
That's a huge help. It's too bad we're getting rid of them. But
legalization still looks about the same to me.

Thanks for the explanation.

                              -Dave

Maybe. We still have the old operators like unpck and shup, so couldn't
those still trigger? Shouldn't we remove them if we're using this
TargetNode method?

Is it very expensive to check masks, in the grand scheme of things?

Probably not, in the old scheme the masks could be checked more than
once during legalization and also more than once in the tablegen file
(since the same mask is used in several patterns). But even if it's
not a big improvement in efficiency IMO the "more readable" approach
is still a big win.

Although we still match the node itself through tblgen patterns, the
tablegen patterns are a lot more clear now, and we were able to remove
lots of confusing code.

That's more to do with the way the patterns grew organically than where
the matching happens.

Yes, but we have to start somewhere...

The code used to be *very* fragile, during legalization there was no
explicit rule or comments of what and when stuff needed to be matched
(and changing a single rule would generate a code that would fail to
match tblgen patterns later),

Can you give an example? When I run into something like that, it's
almost always because a useful pattern is missing.

I don't remember off hand, but I keep the memories of often running
into different issues because of how fragile the code used to be.

all the logic was getting really hard to understand.

Oh yes, it's messy stuff all right. But as I'm moving patches from 2.7
to trunk, I'm not noticing much improvement in the legalization code.
Probably it's still a work in progress, yes?

Yes.

The long term plan is: generate all target specific nodes during
legalization, and once all logic is clear, we can go for more fancy
stuff like having per-processor family shuffle tables describing the
more profitable shuffles for a specific mask, and so on. The work
stopped because of this bug:
http://llvm.org/bugs/show_bug.cgi?id=8156,

Hmm...Again, isn't this due to the matching code being moved from isel
to legalize? It seems like legalize is trying to do too much. It's not
just "legalize" anymore, it's "legalize and optimize."

When I added support for AVX shuffles (which I am upstreaming right
now), I tried to do it in a systematic way. Maybe it will help to
reorganize the other checks in a similar fashion. I haven't really
thought about the bigger picture enough yet.

but IMHO the implementation of x86 shuffle matching is a lot more
clear now then they used to be in the past.

There's certainly been improvement on the TableGen side of things. I
really liked the unpck*, shufp, etc. nodes and the ShuffleVectorSDNode.
That's a huge help. It's too bad we're getting rid of them. But
legalization still looks about the same to me.

The idea is to use tablegen again once we have a clean implementation.
It would be good to have all tables and per-processor logic in a
tablegen file, but I think a clean implementation needs to be done
first. Anyway, I don't have time to work in any of this at the moment,
and I'm not also the one with final word on this. I'm happy to help
with
the bugs that may appear and with some design ideas though.

Thanks for the explanation.

You're welcome.

Bruno Cardoso Lopes <bruno.cardoso@gmail.com> writes:

There's certainly been improvement on the TableGen side of things. I
really liked the unpck*, shufp, etc. nodes and the ShuffleVectorSDNode.
That's a huge help. It's too bad we're getting rid of them. But
legalization still looks about the same to me.

The idea is to use tablegen again once we have a clean implementation.

Ah, good. Having a clearer idea of the roadmap helps.

It would be good to have all tables and per-processor logic in a
tablegen file, but I think a clean implementation needs to be done
first.

Do you mean a clean implementation of lowering?

Anyway, I don't have time to work in any of this at the moment, and
I'm not also the one with final word on this. I'm happy to help with
the bugs that may appear and with some design ideas though.

Understandable. Since I'm working in this area now I'll do what I can
to clean things up. Do let me know if I start heading off in the wrong
direction. :slight_smile:

                                 -Dave

Do you mean a clean implementation of lowering?

Yes.

Understandable. Since I'm working in this area now I'll do what I can
to clean things up. Do let me know if I start heading off in the wrong
direction. :slight_smile:

Nice! Thanks David :slight_smile: