[RFC][RISCV] Selection of complex codegen patterns into RISCV bit manipulation instructions

Hi all,

I'm currently working on the implementation for LLVM of the RISCV Bit
Manipulation ISA extension described by Clifford Wolf in the following
presentation:

and the following document:

The aim is to provide the intrinsic functions to the user in order to
implement code that is more optimal for bit manipulations on RISCV
targets, but also to provide automatic optimization by lowering simple
code patterns into optimized bit manipulation assembly,

%neg = xor i32 %1, \-1                    \-\-\-\-\->      andn t0, t0, t1
%and = and i32 %0, %neg

just in case the user wants such optimization but is not aware of all
the bits that can be optimized.

I'm dealing with the fact that it is pretty hard to select some patterns
of DAG nodes in order to replace them with an optimal machine equivalent
machine instruction.

Take for intsance the count leading zeros operation:

uint32\_t clz \(uint32\_t x\)

\{

    for \(int count = 0; count < 32; count\+\+ \) \{

        if \(\(x << count\) < 0\)

            return count;

    \}

    return 32;

\}

It needs a loop to be performed and that makes it difficult to be
lowered because it goes through several basic blocks, and different
optimizations can easily compromise the pattern recognition.

What I'm wondering is, is there already any place in LLVM where complex
patterns like this are already lowered into single instructions? (e.g.:
clz, that is used quite often). Maybe at a higher level?

Another point of view that I've been suggested and that I'd like to
discuss is: does it make sense to implement such lowering for operations
that normally a user wouldn't implement from scratch when an intrinsic
function is already available for that?

Many thanks.

Paolo Savini

Hi Paolo,

Take for intsance the count leading zeros operation:

The example implementation has a couple of problems (no uint32_t will
be negative, and any shift you'd think might turn a positive number
into a negative one is undefined behaviour).

But there is some code in lib/Transforms/Scalar/LoopIdiomRecognize.cpp
designed to spot loops that really are calculating ctlz etc and
replace them with the proper intrinsic call. The tests seem to give
some examples of the kind of thing it can see:
https://github.com/llvm/llvm-project/blob/master/llvm/test/Transforms/LoopIdiom/X86/ctlz.ll

Another point of view that I've been suggested and that I'd like to
discuss is: does it make sense to implement such lowering for operations
that normally a user wouldn't implement from scratch when an intrinsic
function is already available for that?

Probably not during CodeGen since that mostly works at the basic block
level, but in general yes (hence LoopIdiomRecognize)

Cheers.

Tim.

Hi all,

Hi.

I'm currently working on the implementation for LLVM of the RISCV Bit
Manipulation ISA extension described by Clifford Wolf in the following
presentation:

https://content.riscv.org/wp-content/uploads/2019/06/17.10-b_wolf.pdf

and the following document:

https://github.com/riscv/riscv-bitmanip/blob/master/bitmanip-0.90.pdf

Nice!

The aim is to provide the intrinsic functions to the user in order to
implement code that is more optimal for bit manipulations on RISCV
targets, but also to provide automatic optimization by lowering simple
code patterns into optimized bit manipulation assembly,

    %neg = xor i32 %1, -1 -----> andn t0, t0, t1
    %and = and i32 %0, %neg

just in case the user wants such optimization but is not aware of all
the bits that can be optimized.

I'm dealing with the fact that it is pretty hard to select some patterns
of DAG nodes in order to replace them with an optimal machine equivalent
machine instruction.

Take for intsance the count leading zeros operation:

    uint32_t clz (uint32_t x)
    {
        for (int count = 0; count < 32; count++ ) {
            if ((x << count) < 0)
                return count;
        }
        return 32;
    }

It needs a loop to be performed and that makes it difficult to be
lowered because it goes through several basic blocks, and different
optimizations can easily compromise the pattern recognition.

You only want to lower LLVM IR @llvm.cttz intrinsic in *that* case.

What I'm wondering is, is there already any place in LLVM where complex
patterns like this are already lowered into single instructions? (e.g.:
clz, that is used quite often). Maybe at a higher level?

That depends.
If there's LLVM intrinsic for it, then any normal optimization pass could do it.
In cttz's case it's mainly done in LoopIdiom pass.

Another point of view that I've been suggested and that I'd like to
discuss is: does it make sense to implement such lowering for operations
that normally a user wouldn't implement from scratch when an intrinsic
function is already available for that?

Again, i'd say this is too broad of a question.
If there is LLVM IR intrinsic, then you only need to lower it,
and optionally ensure that middle-end passes form it from appropriate IR.

If there isn't one, then yes, you'd want to match all the beautiful wilderness
of the possible patterns that combine into that instruction.

While it's really tempting to just add IR intrinsic for everything,
please do note that a new intrinsic is completely opaque to the rest of LLVM.
It does not magically get peep-hole folds, so those would need to be added,
especially if you intend to form said intrinsic within the middle-end from IR.

This may change some day when these peep-hole folds are auto-inferred,
but that is not so nowadays. Really looking forward to that.

Many thanks.

Paolo Savini

Roman.

Hi Roman,

That depends.
If there's LLVM intrinsic for it, then any normal optimization pass could do it.
In cttz's case it's mainly done in LoopIdiom pass.

Oh yes. Thank you!

Unfortunately several of the instructions of the bit manipulation
extension don't seem to have an intrinsic already in LLVM.

That will require to add some passes to the middle end.

Again, i'd say this is too broad of a question.
If there is LLVM IR intrinsic, then you only need to lower it,
and optionally ensure that middle-end passes form it from appropriate IR.

If there isn't one, then yes, you'd want to match all the beautiful wilderness
of the possible patterns that combine into that instruction.

While it's really tempting to just add IR intrinsic for everything,
please do note that a new intrinsic is completely opaque to the rest of LLVM.
It does not magically get peep-hole folds, so those would need to be added,
especially if you intend to form said intrinsic within the middle-end from IR.

This may change some day when these peep-hole folds are auto-inferred,
but that is not so nowadays. Really looking forward to that.

It would be definitely interesting.

Anyway adding such complex instructions to the middle end seems material
for another patch. Unless things change in the meantime.

For now we can provide a lower level optimization of smaller bit
manipulation patterns.

But I'll definitely look into adding those passes as they would provide
much more optimization.

Many thanks.

Paolo

Hi Roman,
> That depends.
> If there's LLVM intrinsic for it, then any normal optimization pass could do it.
> In cttz's case it's mainly done in LoopIdiom pass.
Oh yes. Thank you!

Unfortunately several of the instructions of the bit manipulation
extension don't seem to have an intrinsic already in LLVM.

That will require to add some passes to the middle end.

> Again, i'd say this is too broad of a question.
> If there is LLVM IR intrinsic, then you only need to lower it,
> and optionally ensure that middle-end passes form it from appropriate IR.
>
> If there isn't one, then yes, you'd want to match all the beautiful wilderness
> of the possible patterns that combine into that instruction.
>
> While it's really tempting to just add IR intrinsic for everything,
> please do note that a new intrinsic is completely opaque to the rest of LLVM.
> It does not magically get peep-hole folds, so those would need to be added,
> especially if you intend to form said intrinsic within the middle-end from IR.
>
> This may change some day when these peep-hole folds are auto-inferred,
> but that is not so nowadays. Really looking forward to that.
>
It would be definitely interesting.

Anyway adding such complex instructions to the middle end seems material
for another patch. Unless things change in the meantime.

For now we can provide a lower level optimization of smaller bit
manipulation patterns.

But I'll definitely look into adding those passes as they would provide
much more optimization.

I'm not sure what you mean by "more passes" in the reply.
If there is no matching instruction/intrinsic, then i'm not sure how a
pass would help.

*Please* do note my comment about adding new instructions/intrinsics.
While it's not and immovable obstacle, it by no means should be treated lightly.
If you want to add new LLVM IR instruction/intrinsic, with intention of actually
producing it from other instructions in middle-end (as opposed to just lowering
it from compiler front-end, or not producing it in middle-end),
you must also consider how said new IR instruction/intrinsic will affect
all other optimization passes, and *that* cost *is* high.

E.g. if you add 'andn', you then need to find every fold that would look for
and(not(y), x) or and(x, not(y)) and teach it about 'andn'.
Things will be more fun with more complex patterns :slight_smile:

Many thanks.

Paolo

Roman

Hi Roman,

following from a similar discussion that started on Phabricator:

https://reviews.llvm.org/D66479

I'd like to re-elaborate my answer and change a bit the scope of the
question.

Regardless of the user interface that the bit manipulation patch
provides to the user (e.g. frontend intrinsics, C code...) the C or LLVM
IR implementation of some of the instructions from the bit manipulation
proposal cannot (as far as I know) be patternmatched directly with
RISCVISD nodes because they span over multiple basic blocks.

For this reason we need to implement idiom recognition in the middle end
and, if the pattern matches, emit an LLVM intrinsic, like LLVM does for
ctlz and cttz. Than we would be able to select such instruction.

Hi Roman,

That depends.
If there's LLVM intrinsic for it, then any normal optimization pass could do it.
In cttz's case it's mainly done in LoopIdiom pass.

Oh yes. Thank you!

Unfortunately several of the instructions of the bit manipulation
extension don't seem to have an intrinsic already in LLVM.

That will require to add some passes to the middle end.

Again, i'd say this is too broad of a question.
If there is LLVM IR intrinsic, then you only need to lower it,
and optionally ensure that middle-end passes form it from appropriate IR.

If there isn't one, then yes, you'd want to match all the beautiful wilderness
of the possible patterns that combine into that instruction.

While it's really tempting to just add IR intrinsic for everything,
please do note that a new intrinsic is completely opaque to the rest of LLVM.
It does not magically get peep-hole folds, so those would need to be added,
especially if you intend to form said intrinsic within the middle-end from IR.

This may change some day when these peep-hole folds are auto-inferred,
but that is not so nowadays. Really looking forward to that.

It would be definitely interesting.

Anyway adding such complex instructions to the middle end seems material
for another patch. Unless things change in the meantime.

For now we can provide a lower level optimization of smaller bit
manipulation patterns.

But I'll definitely look into adding those passes as they would provide
much more optimization.

I'm not sure what you mean by "more passes" in the reply.
If there is no matching instruction/intrinsic, then i'm not sure how a
pass would help.

*Please* do note my comment about adding new instructions/intrinsics.
While it's not and immovable obstacle, it by no means should be treated lightly.
If you want to add new LLVM IR instruction/intrinsic, with intention of actually
producing it from other instructions in middle-end (as opposed to just lowering
it from compiler front-end, or not producing it in middle-end),
you must also consider how said new IR instruction/intrinsic will affect
all other optimization passes, and *that* cost *is* high.

I see. But we might need to do it anyway. With caution. As you already
pointed out earlier if we add an LLVM intrinsic that is lowered directly
form a front end intrinsic, that would be impenetrable by middle end
optimizations. An advantage would be that it would be lowered "safely"
(with no mutual interference from optimization passes) into the
corresponding asm, but that also means that any optimization from LLVM
that could provide even better code (even faster or smaller than the
expected bit manipulation asm) wouldn't be possible.

All that being said, in the case a user wants to be sure that some
specific bit manipulation asm instructions are selected without
interference (think about for instance critical programs like C
implementations of block ciphers for which both performance and security
are crucial), how would you see to provide inline asm behind the
interface functions (e.g. _rv32_andn):

uint32_t _rv32_andn(uint32_t a, uint32_t b) {

uint32\_t res;

\_\_asm\_\_ \(&quot;andn %0, %1, %2&quot; : &quot;=r&quot;\(res\) : &quot;r&quot;\(a\), &quot;r&quot;\(b\)\);

return res;

}

as opposed to provide a chain of front-end intrinsic that are lowered to
LLVM intrinsics and then asm?

uint32_t _rv32_andn(uint32_t a, uint32_t b) {

return \_\_builtin\_andn\(a, b\);

}

Nothing would change form the user's perspective, but I guess that would
imply a difference in LLVM for ... maintainability?

E.g. if you add 'andn', you then need to find every fold that would look for
and(not(y), x) or and(x, not(y)) and teach it about 'andn'.
Things will be more fun with more complex patterns :slight_smile:

Many thanks.

Paolo

Roman

Paolo