[PATCH] Add a Scalarize pass

Hi Richard,

Thanks for working on this. We should probably move this discussion to llvm-dev because it is not strictly related to the patch review anymore. The code below is not representative of general c/c++ code. Usually only domain specific language (such as OpenCL) contain vector instructions. The LLVM pass manager configuration (pass manager builder) is designed for C/C++ compilers, not for DSLs. People who use LLVM for other compilation flows (such as GPU compilers, other languages) create their own optimization pipe. I am in favor of adding the scalarizer pass so that people who build LLVM-based JITs and compilers could use it. However, I am against adding this pass by default to the pass manager builder. I understand that there are cases where scalarizing early in the pipeline is better, but I don’t think that its worth the added complexity. Every target has a different set of quirks and we try very hard to avoid adding target-specific passes at IR-level. SelectionDAG is not going away soon, and the SD replacement will also have a scalarizing pass - the overall architecture is not going to change. There are always optimization phase ordering problems in the compiler and at the end of the day we need to come up with an optimization pipe and works for most programs that we care about. I still think that scalarizing in SD is a reasonable solution for c/c++.

Thanks,
Nadav

Nadav Rotem <nrotem@apple.com> writes:

Hi Richard,

Thanks for working on this. We should probably move this discussion to
llvm-dev because it is not strictly related to the patch review
anymore.

OK, I removed phabricator and llvm-commits.

The code below is not representative of general c/c++
code. Usually only domain specific language (such as OpenCL) contain
vector instructions. The LLVM pass manager configuration (pass manager
builder) is designed for C/C++ compilers, not for DSLs. People who use
LLVM for other compilation flows (such as GPU compilers, other
languages) create their own optimization pipe. I am in favor of adding
the scalarizer pass so that people who build LLVM-based JITs and
compilers could use it. However, I am against adding this pass by
default to the pass manager builder. I understand that there are cases
where scalarizing early in the pipeline is better, but I don’t think
that its worth the added complexity. Every target has a different set of
quirks and we try very hard to avoid adding target-specific passes at
IR-level. SelectionDAG is not going away soon, and the SD replacement
will also have a scalarizing pass - the overall architecture is not
going to change. There are always optimization phase ordering problems
in the compiler and at the end of the day we need to come up with an
optimization pipe and works for most programs that we care about. I
still think that scalarizing in SD is a reasonable solution for c/c++.

I don't understand the basis for the last statement though. Do you mean
that you think most cases produce better code if scalarised at the SD stage
rather than at the IR level? Could you give an example?

If the idea is to have a clean separation of concerns between the front end
and LLVM, then it seems like there are two obvious approaches:

(a) make it the front end's responsibility to only generate vector widths
    that the target can handle. There should then be no need for vector
    type legalisation (as opposed to operation legalisation).

(b) make LLVM handle vectors of all widths, which is the current situation.

If we stick with (b) then I think LLVM should try to handle those vectors
as efficiently as possible. The argument instead seems to be for:

(c) have code of last resort to handle vectors of all widths, but do not
    try to optimise the resulting scalar operations as much as code that
    was scalar to begin with. If the front end is generating vector
    widths for which the target has no native support, and if the front end
    cares about the performance of that vector code, it should explicitly
    run the Scalarizer pass itself.

    AIUI, it would also be the front end's resposnsibility to identify
    which targets have support for which vector widths and which would
    instead use scalarisation.

That seems to be a less clean interface. E.g. as things stand today,
llvmpipe is able to do everything it needs to do with generic IR.
Porting it to a new target is a very trivial change of a few lines[*].
This seems like a good endorsement of the current interface. But the
interface would be less clean if llvmpipe (and any other front end
that cares) has to duplicate target knowledge that LLVM already has.

[*] There are optimisations to use intrinsics for certain targets,
     but they aren't needed for correctness. Some of them might not
     be needed at all with recent versions of LLVM.

The C example I gave was deliberately small and artificial to show the point.
But you can go quite a long way with the generic vector extensions to C and
C++, just like llvmpipe can use generic IR to do everything it needs to do.

I think your point is that we should never run the Scalarizer pass
for clang, so it shouldn't be added by the pass manager. But regardless
of whether the example code is typical, it seems reasonable to expect
"foo * 4" to be implemented as a shift. To have it implemented as a
multiplication even at -O3 seems like a deficiency.

Even if you think it's unusual for C and C++ to have pre-vectorised code,
I think it's even more unusual for all vector input code to be cold.
So if we have vector code as input, I think we should try to optimise
it as best we can, whether it comes from C, C++, or a domain-specific
front end.

As I said in the phabricator comments, the vectorisation passes convert
scalar code to vector code based on target-specific knowledge. I don't
see why it's a bad thing to also convert vector code to scalar code
based on target-specific knowledge.

Thanks,
Richard

Hi Richard,

Thanks for working on this. Comments below.

I don’t understand the basis for the last statement though. Do you mean
that you think most cases produce better code if scalarised at the SD stage
rather than at the IR level? Could you give an example?

You presented an example that shows that scalarizing vectors allow further optimizations. But I don’t think that this example represents the kind of problems that we run into in general C++ code. We currently consider vector legalization a codegen problem. LLVM is designed this way to handle certain kind of programs. Other users of LLVM (such as OpenCL JITs) do scalarize early in the optimization pipeline because the problem-domain presents lots of vectors that needs to be legalized. I am very supportive of adding the new scalarization pass, but I don’t want you to add it to the PassManagerBuilder because the PMB is designed for static C compilers, that don’t have this problem. Are you interested in improving code generation for c++ programs or for programs from another domain ?

Thanks,
Nadav

Nadav Rotem <nrotem@apple.com> writes:

I don't understand the basis for the last statement though. Do you mean
that you think most cases produce better code if scalarised at the SD stage
rather than at the IR level? Could you give an example?

You presented an example that shows that scalarizing vectors allow
further optimizations. But I don’t think that this example represents
the kind of problems that we run into in general C++ code. We currently
consider vector legalization a codegen problem. LLVM is designed this
way to handle certain kind of programs.

Right. But the reason I wrote the pass in the first place was because
treating it as a codegen problem wasn't producing good results. Very
little of LLVM gets to see the scalar version in a form that it still
understands at the operational level.

Other users of LLVM (such as OpenCL JITs) do scalarize early in the
optimization pipeline because the problem-domain presents lots of
vectors that needs to be legalized. I am very supportive of adding
the new scalarization pass, but I don’t want you to add it to the
PassManagerBuilder because the PMB is designed for static C compilers,
that don’t have this problem.

But why do you think static C compilers don't have this problem?
The vector extensions to C were added for a reason :slight_smile: And you can
go quite a long way with generic vector operations.

Are you interested in improving code generation for c++ programs or
for programs from another domain ?

Both. Or more precisely: I want vector input to be optimised wherever
it comes from.

Are you worried that adding it to PMB will increase compile time?
The pass exits very early for any target that doesn't opt-in to doing
scalarisation at the IR level, without even looking at the function.

Thanks,
Richard

Richard Sandiford <rsandifo@linux.vnet.ibm.com> writes:

Are you worried that adding it to PMB will increase compile time?
The pass exits very early for any target that doesn't opt-in to doing
scalarisation at the IR level, without even looking at the function.

As an alternative, adding Scalarizer and InstCombine passes to
SystemZPassConfig::addIRPasses() would probably give me most of the
benefit without affecting the PMB. Scalarizer itself would then not
test TargetTransformInfo at all, at least in the initial version,
and the scalarisation would still logically be done by codegen.
Would that be OK?

Thanks,
Richard

I actually prefer that the Scalarizer would not touch TTI at all because I view scalarization a canonicalization phase for DSLs, much like SROA breaks structs. I am okay with SystemZ adding it to its own compilation flow.

Thanks,
Nadav

Nadav Rotem <nrotem@apple.com> writes:

Richard Sandiford <rsandifo@linux.vnet.ibm.com> writes:

Are you worried that adding it to PMB will increase compile time?
The pass exits very early for any target that doesn't opt-in to doing
scalarisation at the IR level, without even looking at the function.

As an alternative, adding Scalarizer and InstCombine passes to
SystemZPassConfig::addIRPasses() would probably give me most of the
benefit without affecting the PMB. Scalarizer itself would then not
test TargetTransformInfo at all, at least in the initial version,
and the scalarisation would still logically be done by codegen.
Would that be OK?

I actually prefer that the Scalarizer would not touch TTI at all because
I view scalarization a canonicalization phase for DSLs, much like SROA
breaks structs.

That's what Pekka is thinking of using it for, but it wasn't the reason
I wrote it. The original motivation was llvmpipe, which is a rasteriser
rather than a DSL compiler. The motivation wasn't to canonicalise,
it was to do the same thing that codegen currently does, but in a better
place from an optimisation perspective.

You said in an earlier message:

  Other users of LLVM (such as OpenCL JITs) do scalarize early in the
  optimization pipeline because the problem-domain presents lots of
  vectors that needs to be legalized.

But:

(a) Scalarising and revectorising only makes sense if the vectorisation
    is done with the target in mind. If going from scalar code to vector
    code can depend on the target, why shouldn't the same be true in the
    other direction, for targets without vector support?

(b) The situation you describe isn't the one that applies to llvmpipe.
    In llvmpipe the vectors are nice, known widths that are under the
    driver's own control. We certainly don't want to scalarise and
    revectorise llvmpipe IR on x86_64, or on powerpc with Altivec/VSX.
    The original code is already well vectorised for those targets.
    (And also for ARM NEON I expect.)

    In the llvmpipe case, codegen's type legaliser already makes a good
    decision about what to scalarise and what not to scalarise, without
    any help from llvmpipe. The problem I'm trying to solve is that
    codegen is too late to get the benefit of other IR optimisations.

    So in my case I do not want to _change_ the decision about which
    vectors get scalarised and how. I just want to do it earlier.
    It would be a shame if that meant that llvmpipe had to duplicate
    exactly the decisions that codegen makes wrt scalarisation,
    since codegen can easily make those decisions available through
    TargetTransformInfo.

That's why I thought using TTI in the Scalarizer was a good thing
in principle, at least as an option.

SystemZ is a simple case because there is no vector support. But take MIPS
(which is often a good example when it comes to complicated possibilities :-)).
It has at least four separate vector extensions:

  - <2 x float> support from the MIPS V floating-point extensions,
    carried over to MIPS 32/64.

  - <8 x i8> and <4 x i16> support from the optional MDMX extension,
    now deprecated but used on older chips like the SB-1 and (in a
    modified form) the VR5400.

  - Processor-specific vector extensions for the Loongson range.

  - The new MSA ASE.

That's a lot of possiblities. Maybe the LLVM port will never support
Loongson and MDMX (almost certain for the latter), but the point is that
even if it did support them, the current codegen interface would make the
right decisions about which of the llvmpipe vectors should be scalarised
and how.

If Scalarizer is an all-or-nothing pass then it cannot make as good a
decision for llvmpipe IR, where we don't expect to revectorise the result.
Obviously the current pass is all-or-nothing anyway, but I tried to
structure it so that it would be easy to make per-type decisions in
the future, based on the TargetTransformInfo.

I realise I'm not going to convince you, and I'm going to make the
change anyway. I still think it's the wrong direction though.

Thanks,
Richard

Hi Richard,

The discussion on llvmpipe is irrelevant. llvmpipe has its own pass manager and optimization pipe, it is not a C compiler.

Nadav

Nadav Rotem <nrotem@apple.com> writes:

The discussion on llvmpipe is irrelevant. llvmpipe has its own pass
manager and optimization pipe, it is not a C compiler.

Note that this reply was about whether TargetTransformInfo should be
used in Scalarizer, not whether Scalarizer should be in PMB. I was
trying to explain why I thought that not testing TargetTransformInfo in
Scalarizer would make the pass less useful for llvmpipe's optimisation pipe.

Thanks,
Richard