Is there pass to break down <4 x float> to scalars

Hi, LLVM community,

I write some code in hand using LLVM IR. for simplicity, I write them in <4 x float>. now I found some stores for elements are useless.

for example, If I store {0.0, 1.0, 2.0, 3.0} to a <4 x float> %a. maybe only %a.xy is alive in my program. our target doesn’t feature SIMD instruction, which means we have to lower vector to many scalar instructions. I found llvm doesn’t have DSE in codegen , right?

Is there a pass which can break down vector operation to scalars?

thanks,
–lx

Liu Xin <navy.xliu@gmail.com> writes:

Hi, LLVM community,

I write some code in hand using LLVM IR. for simplicity, I write them in <4
x float>. now I found some stores for elements are useless.

for example, If I store {0.0, 1.0, 2.0, 3.0} to a <4 x float> %a. maybe
only %a.xy is alive in my program. our target doesn't feature SIMD
instruction, which means we have to lower vector to many scalar
instructions. I found llvm doesn't have DSE in codegen , right?

Is there a pass which can break down vector operation to scalars?

I wanted the same thing for SystemZ, which doesn't have vectors,
in order to improve the llvmpipe code. FWIW, here's what I have locally.

It is able to decompose loads and stores, but I found in the llvmpipe case
that this made things worse with TBAA, because DAGCombiner::GaterAllAliases
has some fairly strict limits. So I disabled that by default; use
-decompose-vector-load-store to reenable.

The main motivation for z was instead to get InstCombine to rewrite
things like scalarised selects.

I haven't submitted it yet because it's less of a win than the TBAA
DAGCombiner patch I posted, so I didn't want to distract from that.
It would also need some TargetTransformInfo hooks to decide which
vectors should be decomposed.

Thanks,
Richard

decompose-vectors.diff (32.1 KB)

I wanted the same thing for SystemZ, which doesn't have vectors,
in order to improve the llvmpipe code.

Hi Richard,

This is a nice patch. I was wondering how hard it'd be to do that, and it
seems that you're catching lots of corner cases.

My interest is also due to converting odd vectors into scalars, but to
convert them again to CPU vectors, say from OpenCL to NEON code.

It would also need some TargetTransformInfo hooks to decide which

vectors should be decomposed.

If I got it right, this may not be necessary, or it may even be harmful.

Say you decide that <4 x i32> vectors should be left alone, so that your
pass only scalarise the others. But when the vectorizer passes again (to
try and use CPU vector instructions), it might not match the scalarised
version with the vector, and you end up with data movement between scalar
and vector pipelines, which normally slows down CPUs (at least in ARM's
case). Also, problematic cases like <5 x i32> could be better split into
3+2 pairs, rather than 4+1.

If you scalarise everything, than the vectorizers will have a better chance
of spotting patterns and vectorising the whole lot, then based on target
transform info.

Is that what you had in mind?

cheers,
--renato

Renato Golin <renato.golin@linaro.org> writes:

On 25 October 2013 11:06, Richard Sandiford <rsandifo@linux.vnet.ibm.com>wrote>> It would also need some TargetTransformInfo hooks to decide which

vectors should be decomposed.

If I got it right, this may not be necessary, or it may even be harmful.

Say you decide that <4 x i32> vectors should be left alone, so that your
pass only scalarise the others. But when the vectorizer passes again (to
try and use CPU vector instructions), it might not match the scalarised
version with the vector, and you end up with data movement between scalar
and vector pipelines, which normally slows down CPUs (at least in ARM's
case). Also, problematic cases like <5 x i32> could be better split into
3+2 pairs, rather than 4+1.

If you scalarise everything, than the vectorizers will have a better chance
of spotting patterns and vectorising the whole lot, then based on target
transform info.

Is that what you had in mind?

To be honest I hadn't really thought about targets with vector units
at all. :slight_smile: I was just assuming that we'd want to keep vector operations
together if there's native support. E.g. ISTR comments about not wanting
to rewrite vec_selects because it can be hard to synthesise optimal
sequences from a single canonical form. But I might have got that wrong.
Also, llvmpipe uses intrinsics for some things, so it might be strange
if we decompose IR operations but leave the intriniscs alone.

I'd half wondered whether, as an extension, the pass should split wide
vectors into supported widths. I hadn't thought about the possiblity of
decomposing everything and them reassembling it though. I can see how
that would cope with more cases, like you say.

Thanks,
Richard

Yes, this is a problem with OpenCL as well. I wonder if it'd be useful to
have a compiler-implemented inlined library code for each type of supported
intrinsic, so that you can also lower them to code.

For targets that don't support the intrinsics, well, the IR wouldn't
compile, so being slow is better than not working at all, but I wonder how
much worse it would be to do that.

cheers,
--renato

Hi, Richard,

I think we are solving a same problem. I am working on shader language too. I am not satisfied with current binaries because vector operations are kept in llvm opt.

glsl shader language has an operation called “swizzle”. It can select sub-components of a vector. If a shader only takes components “xy” for a vec4. it’s certainly wasteful to generate 4 operations for a scalar processor.

i think a good solution for llvm is in codegen. Many compiler has codegen optimizer. A DSE is good enough.

Which posted patch about TBAA? you have yet another solution except decompose-vectors?

thanks,

–lx

Hi,

Great to see someone working on this. This will benefit the performance
portability goal of the pocl's OpenCL kernel compiler. It has been one of
the low hanging fruits in improving its implicit WG vectorization
applicability.

The use case there is that sometimes it makes sense to devectorize
the explicitly used vector datatype code of OpenCL kernels in order to make
better opportunities for the "horizontal" vectorization across work-items
inside the work-group.

E.g., the last time I checked, the inner loop vectorizer (which pocl exploits)
just refused to vectorize loops with vector instructions. It might not
be so drastic with the SLP or the BB vectorizer, but in general, it might
make sense to let the vectorizer to do the decisions on how to map the
parallel (scalar) operations best to the vector hardware, and just help it
with the parallelism knowledge propagated from the parallel program.
One can then fall back to the original (hand vectorized) code in case
the autovectorization failed, to get some vector hardware utilization
still.

Liu Xin <navy.xliu@gmail.com> writes:

I think we are solving a same problem. I am working on shader language
too. I am not satisfied with current binaries because vector operations
are kept in llvm opt.

glsl shader language has an operation called "swizzle". It can select
sub-components of a vector. If a shader only takes components "xy" for a
vec4. it's certainly wasteful to generate 4 operations for a scalar
processor.

i think a good solution for llvm is in codegen. Many compiler has codegen
optimizer. A DSE is good enough.

Which posted patch about TBAA? you have yet another solution except
decompose-vectors?

Ah, no, the TBAA thing is separate really. llvmpipe generally operates
on 4 rows at a time, so some functions end up with patterns like:

   load <16 x i8> row0 ...
   load <16 x i8> row1 ...
   load <16 x i8> row2 ...
   load <16 x i8> row3 ...
   ... do stuff ...
   store <16 x i8> row0 ...
   store <16 x i8> row1 ...
   store <16 x i8> row2 ...
   store <16 x i8> row3 ...

Since the row stride is variable, llvm doesn't have enough information
to tell that these rows don't alias. So it has to keep the loads and
stores in order. And z only has 16 general registers, so a naively-
scalarised 16 x i8 operation rapidly runs out. With unmodified llvmpipe
IR we get lots of spills.

Since z also has x86-like register-memory operations, a few spills are
usually OK. But in this case we have to load i8s and immediately
spill them.

So the idea was to add TBAA information to the llvmpipe IR to say that
the rows don't alias. (At the moment I'm only doing that by hand on
saved IR, I've not done it in llvmpipe itself yet.) Combined with
-combiner-alias-analysis -combiner-global-alias-analysis, this allows
the loads and stores to be reordered, which gives much better code.

However, the problem at the moment is that there are other scalar loads
that get rewritten by DAGCombiner and the legalisation code, and in the
process lose their TBAA info. This then interferes with the optimisation
above. So I wanted to make sure that the TBAA information is kept around:

  http://llvm-reviews.chandlerc.com/D1894

It was just that if I had a choice of only getting one of the two patches in,
it'd definitely be the D1894 one. It sounds like there's more interest in
the DecomposeVectors patch than I'd expected though, so I'll get back to it.

Maybe as a first cut we can have a TargetTransformInfo hook to enable or
disable the pass wholesale, with a command-line option to override it.

Thanks to you an Renato for the feedback.

Richard

Pekka Jääskeläinen <pekka.jaaskelainen@tut.fi> writes:

E.g., the last time I checked, the inner loop vectorizer (which pocl exploits)
just refused to vectorize loops with vector instructions. It might not
be so drastic with the SLP or the BB vectorizer, but in general, it might
make sense to let the vectorizer to do the decisions on how to map the
parallel (scalar) operations best to the vector hardware, and just help it
with the parallelism knowledge propagated from the parallel program.
One can then fall back to the original (hand vectorized) code in case
the autovectorization failed, to get some vector hardware utilization
still.

Sounds like a nice compromise if it could be made to work. Would it be
LLVM that reverts to the autovectorised version, or pocl?

To be honest I hadn't really thought about targets with vector units
at all.:slight_smile: I was just assuming that we'd want to keep vector operations
together if there's native support. E.g. ISTR comments about not wanting
to rewrite vec_selects because it can be hard to synthesise optimal
sequences from a single canonical form. But I might have got that wrong.
Also, llvmpipe uses intrinsics for some things, so it might be strange
if we decompose IR operations but leave the intriniscs alone.

The issue of intrinsics and vectorization was discussed some time ago.
There it might be better to devectorize to a scalar version of the
instrinsics (if available) as at least the loopvectorizer can vectorize
also a set of selected intrinsics, and the target might have direct
machine instructions for those (which could not be exploited easily from
"inlined" versions).

Yeah, I vaguely remember some objections to handling target-specific
intrinsics at the IR level, which I heard is what put others off doing
the pass. In my case life is much simpler: there are no intrinsics
and there's no native vector support. So in some ways I've only done
the easy bit. I'm just hoping it's also the less controversial bit.

Do the OpenCL loops that don't get vectorised (because they already
have some vector ops) also contain vector intrinsics, or is it usually
generic vector IR? Would a pass that just scalarises the generic
operations but keeps intrinsics as-is be any use to you, or would the
intrinsics really need to be handled too?

Thanks for the feedback.

Richard

Sounds like a nice compromise if it could be made to work. Would it be
LLVM that reverts to the autovectorised version, or pocl?

In my opinion LLVM, because this benefits not only the OpenCL WG
autovectorization of pocl, but any code that uses explicit vector instructions and might be more efficiently autovectorized if those were devectorized first. E.g. C code that uses the vector datatypes using
the Clang's attributes.

Yeah, I vaguely remember some objections to handling target-specific
intrinsics at the IR level, which I heard is what put others off doing
the pass. In my case life is much simpler: there are no intrinsics
and there's no native vector support. So in some ways I've only done
the easy bit. I'm just hoping it's also the less controversial bit.

One solution is to try to scalarize the intrinsic calls
too (if one knows of the matching ones), and if it fails, keep them
intact (potentially leads to additional unpack/pack etc. overheads if
the autovectorization of them fails).

Do the OpenCL loops that don't get vectorised (because they already
have some vector ops) also contain vector intrinsics, or is it usually
generic vector IR? Would a pass that just scalarises the generic
operations but keeps intrinsics as-is be any use to you, or would the
intrinsics really need to be handled too?

Yes to both. It is useful without the intrinsics support, but the
above handling might improve the results for some kernels.

OpenCL builtins (math functions) have vector versions so they are
called if one uses the vector data types. Then sometimes one ends up
having vector instrinsics in the bitcode.

In case I wasn't clear, there are two dimensions on how one can
autovectorize the OpenCL C kernels: inside the SPMD kernel
descriptions (single work-item) itself, or the "implicit parallel loops"
across all work-items in the work-group. I was referring to the latter
as that is where the massive data parallelism and, thus more scalable
vectorization opportunities, usually are.

This sounds like a use case for parallelism metadata for
unrolled parallel code (which a "parallel program AA"
would exploit). I proposed a format for it some time ago here,
but haven't had time to go on with it.

http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-March/060270.html

Hi, Richard,

Your decompose vector patch works perfect on my site. Unfortunately, I still get stupid code because llvm ‘-dse’ fails followed by ‘decompose-vector’ .
I read the DSE code and it is definitely capable of eliminating unused memory stores if its AA works. I don’t think basic AA works for me. I found my program have complex memory accesses, such as bi-dimentional arrays.

Sorry, I am not good at AA. In my concept, TBAA is just for C++. Do you mean that you can make use of TBAA to help DSE?
Why TBAA is total null for my program ? basicaa is even better than -tbaa.

liuxin@rd58:~/testbed$ opt -tbaa -aa-eval -decompose-vectors -mem2reg -dse test.bc -debug-pass=Structure -o test.opt.bc -stats
Pass Arguments: -targetlibinfo -no-aa -tbaa -aa-eval -decompose-vectors -domtree -mem2reg -memdep -dse -preverify -verify
Target Library Information
No Alias Analysis (always returns ‘may’ alias)
Type-Based Alias Analysis
ModulePass Manager
FunctionPass Manager
Exhaustive Alias Analysis Precision Evaluator
Decompose vector operations into smaller pieces
Dominator Tree Construction
Promote Memory to Register
Memory Dependence Analysis
Dead Store Elimination
Preliminary module verification
Module Verifier
Bitcode Writer
===== Alias Analysis Evaluator Report =====
1176 Total Alias Queries Performed
0 no alias responses (0.0%)
1176 may alias responses (100.0%)
0 partial alias responses (0.0%)
0 must alias responses (0.0%)
Alias Analysis Evaluator Pointer Alias Summary: 0%/100%/0%/0%
49 Total ModRef Queries Performed
0 no mod/ref responses (0.0%)
0 mod responses (0.0%)
0 ref responses (0.0%)
49 mod & ref responses (100.0%)
Alias Analysis Evaluator Mod/Ref Summary: 0%/0%/0%/100%

Liu Xin <navy.xliu@gmail.com> writes:

Your decompose vector patch works perfect on my site. Unfortunately, I
still get stupid code because llvm '-dse' fails followed by
'decompose-vector' .
I read the DSE code and it is definitely capable of eliminating unused
memory stores if its AA works. I don't think basic AA works for me. I
found my program have complex memory accesses, such as bi-dimentional
arrays.

Sorry, I am not good at AA. In my concept, TBAA is just for C++.

Well, as I understand it, what's called TBAA in llvm is mostly an alias
set hierarchy. You can use the same infrastructure for any situation in
which you know that two accesses can't overlap, even if that doesn't
really map to "type"s in the language sense. So it's more than just C++
(or other languages, like LangRef.rst says).

In my case, I'm using TBAA for IR generated by llvmpipe. The information
I'm adding isn't really related to the C types in llvmpipe (or gallium/mesa
generally). It just says that two accesses can't overlap because they
refer to different arrays, or different rows/slices of the same array.

Do you mean that you can make use of TBAA to help DSE?

The main reason I wanted TBAA is to help scheduling. None of the
accesses are dead, but I want to able to interleave them to reduce
register pressure.

Why TBAA is total null for my program ? basicaa is even better than -tbaa.

liuxin@rd58:~/testbed$ opt -tbaa -aa-eval -decompose-vectors -mem2reg -dse
test.bc -debug-pass=Structure -o test.opt.bc -stats
Pass Arguments: -targetlibinfo -no-aa -tbaa -aa-eval -decompose-vectors
-domtree -mem2reg -memdep -dse -preverify -verify
Target Library Information
No Alias Analysis (always returns 'may' alias)
Type-Based Alias Analysis
  ModulePass Manager
    FunctionPass Manager
      Exhaustive Alias Analysis Precision Evaluator
      Decompose vector operations into smaller pieces
      Dominator Tree Construction
      Promote Memory to Register
      Memory Dependence Analysis
      Dead Store Elimination
      Preliminary module verification
      Module Verifier
    Bitcode Writer
===== Alias Analysis Evaluator Report =====
  1176 Total Alias Queries Performed
  0 no alias responses (0.0%)
  1176 may alias responses (100.0%)
  0 partial alias responses (0.0%)
  0 must alias responses (0.0%)
  Alias Analysis Evaluator Pointer Alias Summary: 0%/100%/0%/0%
  49 Total ModRef Queries Performed
  0 no mod/ref responses (0.0%)
  0 mod responses (0.0%)
  0 ref responses (0.0%)
  49 mod & ref responses (100.0%)
  Alias Analysis Evaluator Mod/Ref Summary: 0%/0%/0%/100%

Our c/c++ compiler uses steensguaard's points-to algorithm, so I turns to
find -steens-aa. It seems that llvm's poolalloc implements steens-aa,
right? does it still maintain?
I found I can not build rDSA using the latest llvm headers.

Sorry, I don't really know this part of llvm, so I'm not sure what to suggest.
Hopefully someone else will comment.

Thanks,
Richard

Richard,

Thank you. Building up a points-to algorithm is non-trivia. I will investigate on this thread. thank you for the suggest!

–lx