[RFC] bitfield access shrinking

In http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20120827/063200.html,
consecutive bitfields are wrapped as a group and represented as a
large integer and emits loads stores and bit operations appropriate
for extracting bits from within it. It fixes the problem of violating
C++11 memory model that original widen load/store of bitfield was
facing. It also brings more coalescing opportunities across bitfields.

If some bitfields are natural aligned and their num of bits can form a
legal integer type that the target supports, it is more efficient to
access them directly than doing a large integer load/store plus a
series of bit operations. We call such reverse transformation legal
type bitfield shrinking. Currently, llvm depends on DAGCombiner to do
such shrinking.

However, DAGCombiner has the one-basic-block-a-time limitation, so we
started to implement a new shrinking optimization in
https://reviews.llvm.org/D30416, and initially put it in instcombine,
then moved it to CGP because we want to use some TargetLowering
information.

The initial implementation in D30416 only supports load-and-or-store
pattern matching, and it uses a inst-by-inst scan as a safety check to
make sure there is no other memory write insn between the load and
store (no alias query is done). After getting the initial
implementation, we found more problems: EarlyCSE, LoadPRE and even
InstCombine itself can do coalescing before the shrinking (LoadPRE
does it the most thoroughly).
The coalescing can move the load many BasicBlocks earlier and make
simple inst-by-inst scan unscalable and oftentimes fail. It also
breaks the load-and-or-store pattern. An example is below:

Before coalescing done by earlycse/loadpre:
%bf.load = load i64, i64* %0, align 8
%bf.clear = and i64 %bf.load, -65536
%bf.set = or i64 %bf.value, %bf.clear
store i64 %bf.set2, i64* %9, align 8
.....
%bf.load1 = load i64, i64* %0, align 8
%bf.clear1 = and i64 %bf.load1, -4294901761
%bf.set1 = or i64 %bf.value1, %bf.clear1
store i64 %bf.set2, i64* %9, align 8
.....
%bf.load2 = load i64, i64* %0, align 8
%bf.clear2 = and i64 %bf.load2, -4294901761
%bf.set2 = or i64 %bf.value2, %bf.clear2
store i64 %bf.set2, i64* %9, align 8

After coalescing, it will become:
%bf.load = load i64, i64* %0, align 8
%bf.clear = and i64 %bf.load, -65536
%bf.set = or i64 %bf.value, %bf.clear
.....
%bf.clear1 = and i64 %bf.set, -4294901761
%bf.set1 = or i64 %bf.value1, %bf.clear1
.....
%bf.clear2 = and i64 %bf.set1, -4294901761
%bf.set2 = or i64 %bf.value2, %bf.clear2
store i64 %bf.set2, i64* %9, align 8

After load-store coalescing, %bf.load now is far away from the store,
and the previous load-and-or-store pattern disappears.

A simple idea to fix it is to move the shrinking in a very early pass
before the first pass of EarlyCSE. However, if we move shrinking
earlier, it is possible to block the coalescing of other ilegal type
bitfields which can not be shrinked. So for coalescing and shrinking,
no matter which one is first, it will block the other one.

After some discussions with Eli and Michael, I come up with another
idea to let shrinking stay in the late pipeline. It needs two changes
to the current patch:

1. extending the pattern match to handle store(or(and(or(and...))
pattern above. It needs to analyze the bit mask of every and-or pairs.
If the last and-or pair touch different section with the other and-or
pairs, we can split the original store into two, and do the shrinking
for the second store, like below

%bf.load = load i64, i64* %0, align 8
%bf.clear = and i64 %bf.load, -65536
%bf.set = or i64 %bf.value, %bf.clear
.....

%bf.clear1 = and i64 %bf.set, -4294901761
%bf.set1 = or i64 %bf.value1, %bf.clear1
store i64 %bf.set1, i64* %0, align 8 // the first store.
.....

%bf.value2.shrinked = shrink_op %bf.value2
store i16 %bf.value2.shrinked, i64* %0, align 8 // shrinked store.

2. use memoryssa + alias query to do the safety check. Because
dominator tree is not properly updated in CGP, I have to create a new
pass and put it before CGP, in order to use memoryssa.

Eli suggested me to ask for more opinions before start writting code.
I think it is a good idea and here is the post. Comments are
appreciated.

Thanks,
Wei.

We could add intrinsics to extract/insert a bitfield, which would simplify a lot of that bitwise logic.

-Krzysztof

We could add intrinsics to extract/insert a bitfield, which would simplify a lot of that bitwise logic.

But then you need to teach a bunch of places about how to simply them, fold using bitwise logic and other things that reduce demanded bits into them, etc. This seems like a difficult tradeoff.

  -Hal

In [cfe-commits] PATCH: Large re-working of bitfield IR-gen, and a fix for PR13691,
consecutive bitfields are wrapped as a group and represented as a
large integer and emits loads stores and bit operations appropriate
for extracting bits from within it. It fixes the problem of violating
C++11 memory model that original widen load/store of bitfield was
facing. It also brings more coalescing opportunities across bitfields.

If some bitfields are natural aligned and their num of bits can form a
legal integer type that the target supports, it is more efficient to
access them directly than doing a large integer load/store plus a
series of bit operations. We call such reverse transformation legal
type bitfield shrinking. Currently, llvm depends on DAGCombiner to do
such shrinking.

However, DAGCombiner has the one-basic-block-a-time limitation, so we
started to implement a new shrinking optimization in
⚙ D30416 [BitfieldShrinking] Shrink Bitfields load/store when the bitfields are legal to access independently, and initially put it in instcombine,
then moved it to CGP because we want to use some TargetLowering
information.

The initial implementation in D30416 only supports load-and-or-store
pattern matching, and it uses a inst-by-inst scan as a safety check to
make sure there is no other memory write insn between the load and
store (no alias query is done). After getting the initial
implementation, we found more problems: EarlyCSE, LoadPRE and even
InstCombine itself can do coalescing before the shrinking (LoadPRE
does it the most thoroughly).
The coalescing can move the load many BasicBlocks earlier and make
simple inst-by-inst scan unscalable and oftentimes fail. It also
breaks the load-and-or-store pattern. An example is below:

Before coalescing done by earlycse/loadpre:
%bf.load = load i64, i64* %0, align 8
%bf.clear = and i64 %bf.load, -65536
%bf.set = or i64 %bf.value, %bf.clear
store i64 %bf.set2, i64* %9, align 8
.....
%bf.load1 = load i64, i64* %0, align 8
%bf.clear1 = and i64 %bf.load1, -4294901761
%bf.set1 = or i64 %bf.value1, %bf.clear1
store i64 %bf.set2, i64* %9, align 8
.....
%bf.load2 = load i64, i64* %0, align 8
%bf.clear2 = and i64 %bf.load2, -4294901761
%bf.set2 = or i64 %bf.value2, %bf.clear2
store i64 %bf.set2, i64* %9, align 8

After coalescing, it will become:
%bf.load = load i64, i64* %0, align 8
%bf.clear = and i64 %bf.load, -65536
%bf.set = or i64 %bf.value, %bf.clear
.....
%bf.clear1 = and i64 %bf.set, -4294901761
%bf.set1 = or i64 %bf.value1, %bf.clear1
.....
%bf.clear2 = and i64 %bf.set1, -4294901761
%bf.set2 = or i64 %bf.value2, %bf.clear2
store i64 %bf.set2, i64* %9, align 8

After load-store coalescing, %bf.load now is far away from the store,
and the previous load-and-or-store pattern disappears.

A simple idea to fix it is to move the shrinking in a very early pass
before the first pass of EarlyCSE. However, if we move shrinking
earlier, it is possible to block the coalescing of other ilegal type
bitfields which can not be shrinked. So for coalescing and shrinking,
no matter which one is first, it will block the other one.

After some discussions with Eli and Michael, I come up with another
idea to let shrinking stay in the late pipeline. It needs two changes
to the current patch:

1. extending the pattern match to handle store(or(and(or(and...))
pattern above. It needs to analyze the bit mask of every and-or pairs.
If the last and-or pair touch different section with the other and-or
pairs, we can split the original store into two, and do the shrinking
for the second store, like below

%bf.load = load i64, i64* %0, align 8
%bf.clear = and i64 %bf.load, -65536
%bf.set = or i64 %bf.value, %bf.clear
.....

%bf.clear1 = and i64 %bf.set, -4294901761
%bf.set1 = or i64 %bf.value1, %bf.clear1
store i64 %bf.set1, i64* %0, align 8 // the first store.
.....

%bf.value2.shrinked = shrink_op %bf.value2
store i16 %bf.value2.shrinked, i64* %0, align 8 // shrinked store.

2. use memoryssa + alias query to do the safety check. Because
dominator tree is not properly updated in CGP, I have to create a new
pass and put it before CGP, in order to use memoryssa.

This makes sense to me. Should we just fix CGP to update the DT instead of working around it?

  -Hal

Bitfield extraction/insertion generally does not simplify well with surrounding arithmetic. The main generator of the bitwise manipulation is SROA and if it was the only creator of these intrinsics, it would already make things simpler. We'd also need to teach the combiner about them, but that's still fairly localized.
On the other hand, they would make the intent of the code explicit. Targets could handle them any way they want, whether it's loading an i128, or an i16 from an adjusted offset.

-Krzysztof

Yeah,.

This seems similar to trying to optimize extract/insert values with real binary operations.

We do it, but … historically we only do it by teaching things they look like things they already know :wink:
(IE we teach gvn that when it looks like this, it’s really an add of these two things)

We could add intrinsics to extract/insert a bitfield, which would
simplify a lot of that bitwise logic.

But then you need to teach a bunch of places about how to simply them,
fold using bitwise logic and other things that reduce demanded bits into
them, etc. This seems like a difficult tradeoff.

Bitfield extraction/insertion generally does not simplify well with surrounding arithmetic.

Maybe. I've seen plenty of places where we've simplified bitfield extractions/insertions with other using/generating logic (masks, shifts, and the like).

The main generator of the bitwise manipulation is SROA and if it was the only creator of these intrinsics, it would already make things simpler. We'd also need to teach the combiner about them, but that's still fairly localized.
On the other hand, they would make the intent of the code explicit. Targets could handle them any way they want, whether it's loading an i128, or an i16 from an adjusted offset.

I wouldn't want just SROA to produce them. If we're going to have them, then we should canonicalize toward them. That implies having matching logic, so perhaps we're back where we started.

  -Hal

In
[cfe-commits] PATCH: Large re-working of bitfield IR-gen, and a fix for PR13691,
consecutive bitfields are wrapped as a group and represented as a
large integer and emits loads stores and bit operations appropriate
for extracting bits from within it. It fixes the problem of violating
C++11 memory model that original widen load/store of bitfield was
facing. It also brings more coalescing opportunities across bitfields.

If some bitfields are natural aligned and their num of bits can form a
legal integer type that the target supports, it is more efficient to
access them directly than doing a large integer load/store plus a
series of bit operations. We call such reverse transformation legal
type bitfield shrinking. Currently, llvm depends on DAGCombiner to do
such shrinking.

However, DAGCombiner has the one-basic-block-a-time limitation, so we
started to implement a new shrinking optimization in
⚙ D30416 [BitfieldShrinking] Shrink Bitfields load/store when the bitfields are legal to access independently, and initially put it in instcombine,
then moved it to CGP because we want to use some TargetLowering
information.

The initial implementation in D30416 only supports load-and-or-store
pattern matching, and it uses a inst-by-inst scan as a safety check to
make sure there is no other memory write insn between the load and
store (no alias query is done). After getting the initial
implementation, we found more problems: EarlyCSE, LoadPRE and even
InstCombine itself can do coalescing before the shrinking (LoadPRE
does it the most thoroughly).
The coalescing can move the load many BasicBlocks earlier and make
simple inst-by-inst scan unscalable and oftentimes fail. It also
breaks the load-and-or-store pattern. An example is below:

Before coalescing done by earlycse/loadpre:
%bf.load = load i64, i64* %0, align 8
%bf.clear = and i64 %bf.load, -65536
%bf.set = or i64 %bf.value, %bf.clear
store i64 %bf.set2, i64* %9, align 8
.....
%bf.load1 = load i64, i64* %0, align 8
%bf.clear1 = and i64 %bf.load1, -4294901761
%bf.set1 = or i64 %bf.value1, %bf.clear1
store i64 %bf.set2, i64* %9, align 8
.....
%bf.load2 = load i64, i64* %0, align 8
%bf.clear2 = and i64 %bf.load2, -4294901761
%bf.set2 = or i64 %bf.value2, %bf.clear2
store i64 %bf.set2, i64* %9, align 8

After coalescing, it will become:
%bf.load = load i64, i64* %0, align 8
%bf.clear = and i64 %bf.load, -65536
%bf.set = or i64 %bf.value, %bf.clear
.....
%bf.clear1 = and i64 %bf.set, -4294901761
%bf.set1 = or i64 %bf.value1, %bf.clear1
.....
%bf.clear2 = and i64 %bf.set1, -4294901761
%bf.set2 = or i64 %bf.value2, %bf.clear2
store i64 %bf.set2, i64* %9, align 8

After load-store coalescing, %bf.load now is far away from the store,
and the previous load-and-or-store pattern disappears.

A simple idea to fix it is to move the shrinking in a very early pass
before the first pass of EarlyCSE. However, if we move shrinking
earlier, it is possible to block the coalescing of other ilegal type
bitfields which can not be shrinked. So for coalescing and shrinking,
no matter which one is first, it will block the other one.

After some discussions with Eli and Michael, I come up with another
idea to let shrinking stay in the late pipeline. It needs two changes
to the current patch:

1. extending the pattern match to handle store(or(and(or(and...))
pattern above. It needs to analyze the bit mask of every and-or pairs.
If the last and-or pair touch different section with the other and-or
pairs, we can split the original store into two, and do the shrinking
for the second store, like below

%bf.load = load i64, i64* %0, align 8
%bf.clear = and i64 %bf.load, -65536
%bf.set = or i64 %bf.value, %bf.clear
.....

%bf.clear1 = and i64 %bf.set, -4294901761
%bf.set1 = or i64 %bf.value1, %bf.clear1
store i64 %bf.set1, i64* %0, align 8 // the first
store.
.....

%bf.value2.shrinked = shrink_op %bf.value2
store i16 %bf.value2.shrinked, i64* %0, align 8 // shrinked store.

2. use memoryssa + alias query to do the safety check. Because
dominator tree is not properly updated in CGP, I have to create a new
pass and put it before CGP, in order to use memoryssa.

This makes sense to me. Should we just fix CGP to update the DT instead of
working around it?

-Hal

I am not familiar enough with CGP to tell. I just notice ModifiedDT is
modified at several places in CGP, which indicates there are a few
transformations needing their specific dominator tree maintainance
work. And I remember simplifyCFG also doesn't use dominator tree to
save the effort to maintain it on the fly. So maybe it is easier to
separate CGP into two parts: which does need dominator tree and
doesn't. Those transformations which don't need dominator tree can
stay together.

Currently, I know consthoisting is another potential CGP
transformation but left out because of its need of dominator tree. But
consthoisting has already evolved to contain a substantial amount of
code so may be better to stay as a separate pass.

Thanks,
Wei.

I think it’d be nice to keep CGP as “a grab-bag of relatively simple transformations that don’t require/preserve any complex analyses”.

Note that that’s not quite where we’re at right now - while CGP doesn’t seem to use AA at all, it does have one transformation that uses LI and DT - eliminateMostlyEmptyBlocks. And it constructs those on demand. :-
We may want to split that out as well.

(The “ModifiedDT” flag is a bit of red herring - IIUC, what it really means is ModifiedCFG, I think the only place that uses it is the optimizeBlock() iteration - and that probably ought to be using a worklist instead…)

:frowning: I certainly don’t mind breaking things into separate passes so long as there aren’t phase ordering problems. I’ll point out, however, that: - If they’re relatively-simple transformations, then updating the DT is probably not that hard (obviously I could have missed something, but I skimmed the code that this seems to be the case) - There are certainly things that are simple to do, but only if you have some proper analysis results. People tend to do hacky things to work around not having analysis results and that’s not something we want to encourage. Thus, I’d still vote for fixing CGP to preserve DT. -Hal

In
[cfe-commits] PATCH: Large re-working of bitfield IR-gen, and a fix for PR13691,
consecutive bitfields are wrapped as a group and represented as a
large integer and emits loads stores and bit operations appropriate
for extracting bits from within it. It fixes the problem of violating
C++11 memory model that original widen load/store of bitfield was
facing. It also brings more coalescing opportunities across bitfields.

If some bitfields are natural aligned and their num of bits can form a
legal integer type that the target supports, it is more efficient to
access them directly than doing a large integer load/store plus a
series of bit operations. We call such reverse transformation legal
type bitfield shrinking. Currently, llvm depends on DAGCombiner to do
such shrinking.

However, DAGCombiner has the one-basic-block-a-time limitation, so we
started to implement a new shrinking optimization in
⚙ D30416 [BitfieldShrinking] Shrink Bitfields load/store when the bitfields are legal to access independently, and initially put it in instcombine,
then moved it to CGP because we want to use some TargetLowering
information.

The initial implementation in D30416 only supports load-and-or-store
pattern matching, and it uses a inst-by-inst scan as a safety check to
make sure there is no other memory write insn between the load and
store (no alias query is done). After getting the initial
implementation, we found more problems: EarlyCSE, LoadPRE and even
InstCombine itself can do coalescing before the shrinking (LoadPRE
does it the most thoroughly).
The coalescing can move the load many BasicBlocks earlier and make
simple inst-by-inst scan unscalable and oftentimes fail. It also
breaks the load-and-or-store pattern. An example is below:

Before coalescing done by earlycse/loadpre:
%bf.load = load i64, i64* %0, align 8
%bf.clear = and i64 %bf.load, -65536
%bf.set = or i64 %bf.value, %bf.clear
store i64 %bf.set2, i64* %9, align 8
.....
%bf.load1 = load i64, i64* %0, align 8
%bf.clear1 = and i64 %bf.load1, -4294901761
%bf.set1 = or i64 %bf.value1, %bf.clear1
store i64 %bf.set2, i64* %9, align 8
.....
%bf.load2 = load i64, i64* %0, align 8
%bf.clear2 = and i64 %bf.load2, -4294901761
%bf.set2 = or i64 %bf.value2, %bf.clear2
store i64 %bf.set2, i64* %9, align 8

After coalescing, it will become:
%bf.load = load i64, i64* %0, align 8
%bf.clear = and i64 %bf.load, -65536
%bf.set = or i64 %bf.value, %bf.clear
.....
%bf.clear1 = and i64 %bf.set, -4294901761
%bf.set1 = or i64 %bf.value1, %bf.clear1
.....
%bf.clear2 = and i64 %bf.set1, -4294901761
%bf.set2 = or i64 %bf.value2, %bf.clear2
store i64 %bf.set2, i64* %9, align 8

After load-store coalescing, %bf.load now is far away from the store,
and the previous load-and-or-store pattern disappears.

A simple idea to fix it is to move the shrinking in a very early pass
before the first pass of EarlyCSE. However, if we move shrinking
earlier, it is possible to block the coalescing of other ilegal type
bitfields which can not be shrinked. So for coalescing and shrinking,
no matter which one is first, it will block the other one.

After some discussions with Eli and Michael, I come up with another
idea to let shrinking stay in the late pipeline. It needs two changes
to the current patch:

1. extending the pattern match to handle store(or(and(or(and...))
pattern above. It needs to analyze the bit mask of every and-or pairs.
If the last and-or pair touch different section with the other and-or
pairs, we can split the original store into two, and do the shrinking
for the second store, like below

%bf.load = load i64, i64* %0, align 8
%bf.clear = and i64 %bf.load, -65536
%bf.set = or i64 %bf.value, %bf.clear
.....

%bf.clear1 = and i64 %bf.set, -4294901761
%bf.set1 = or i64 %bf.value1, %bf.clear1
store i64 %bf.set1, i64* %0, align 8 // the first
store.
.....

%bf.value2.shrinked = shrink_op %bf.value2
store i16 %bf.value2.shrinked, i64* %0, align 8 // shrinked store.

2. use memoryssa + alias query to do the safety check. Because
dominator tree is not properly updated in CGP, I have to create a new
pass and put it before CGP, in order to use memoryssa.

This makes sense to me. Should we just fix CGP to update the DT instead of
working around it?

  -Hal

I am not familiar enough with CGP to tell. I just notice ModifiedDT is
modified at several places in CGP, which indicates there are a few
transformations needing their specific dominator tree maintainance
work. And I remember simplifyCFG also doesn't use dominator tree to
save the effort to maintain it on the fly. So maybe it is easier to
separate CGP into two parts: which does need dominator tree and
doesn't. Those transformations which don't need dominator tree can
stay together.

Currently, I know consthoisting is another potential CGP
transformation but left out because of its need of dominator tree. But
consthoisting has already evolved to contain a substantial amount of
code so may be better to stay as a separate pass.

That might very well be the case here too (the matching logic might grow in complexity). CGP has now grown to over 5K lines, and there are definitely pieces that can get split out. There is a big chunk that runs iteratively, however, so that might not buy us as much as we might hope.

  -Hal

We could add intrinsics to extract/insert a bitfield, which would
simplify a lot of that bitwise logic.

But then you need to teach a bunch of places about how to simply them,
fold using bitwise logic and other things that reduce demanded bits into
them, etc. This seems like a difficult tradeoff

It’s a bit worse than that from what I’ve seen.

We have a fair amount of code that looks like bitfield accesses but is done manually. So we would have a serious canonicalization issue as well needing to canonicalize toward these intrinsics. =[

Bitfield extraction/insertion generally does not simplify well with
surrounding arithmetic.

Maybe. I’ve seen plenty of places where we’ve simplified bitfield
extractions/insertions with other using/generating logic (masks, shifts,
and the like).

We have had numerous test cases from benchmarks where combining basic bitfield math is important. So while in some cases it may not be important, in a number of cases it is and so I feel like we should design things to support reasonably comprehensive combining of the bitfield accesses.

The main generator of the bitwise manipulation is SROA

?? Clang directly generates the bit insert and extract patterns as well?

In
http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20120827/063200.html,
consecutive bitfields are wrapped as a group and represented as a
large integer and emits loads stores and bit operations appropriate
for extracting bits from within it. It fixes the problem of violating
C++11 memory model that original widen load/store of bitfield was
facing. It also brings more coalescing opportunities across bitfields.

If some bitfields are natural aligned and their num of bits can form a
legal integer type that the target supports, it is more efficient to
access them directly than doing a large integer load/store plus a
series of bit operations. We call such reverse transformation legal
type bitfield shrinking. Currently, llvm depends on DAGCombiner to do
such shrinking.

However, DAGCombiner has the one-basic-block-a-time limitation, so we
started to implement a new shrinking optimization in
https://reviews.llvm.org/D30416, and initially put it in instcombine,
then moved it to CGP because we want to use some TargetLowering
information.

The initial implementation in D30416 only supports load-and-or-store
pattern matching, and it uses a inst-by-inst scan as a safety check to
make sure there is no other memory write insn between the load and
store (no alias query is done). After getting the initial
implementation, we found more problems: EarlyCSE, LoadPRE and even
InstCombine itself can do coalescing before the shrinking (LoadPRE
does it the most thoroughly).
The coalescing can move the load many BasicBlocks earlier and make
simple inst-by-inst scan unscalable and oftentimes fail. It also
breaks the load-and-or-store pattern. An example is below:

Before coalescing done by earlycse/loadpre:
%bf.load = load i64, i64* %0, align 8
%bf.clear = and i64 %bf.load, -65536
%bf.set = or i64 %bf.value, %bf.clear
store i64 %bf.set2, i64* %9, align 8

%bf.load1 = load i64, i64* %0, align 8
%bf.clear1 = and i64 %bf.load1, -4294901761
%bf.set1 = or i64 %bf.value1, %bf.clear1
store i64 %bf.set2, i64* %9, align 8

%bf.load2 = load i64, i64* %0, align 8
%bf.clear2 = and i64 %bf.load2, -4294901761
%bf.set2 = or i64 %bf.value2, %bf.clear2
store i64 %bf.set2, i64* %9, align 8

After coalescing, it will become:
%bf.load = load i64, i64* %0, align 8
%bf.clear = and i64 %bf.load, -65536
%bf.set = or i64 %bf.value, %bf.clear

%bf.clear1 = and i64 %bf.set, -4294901761
%bf.set1 = or i64 %bf.value1, %bf.clear1

%bf.clear2 = and i64 %bf.set1, -4294901761
%bf.set2 = or i64 %bf.value2, %bf.clear2
store i64 %bf.set2, i64* %9, align 8

After load-store coalescing, %bf.load now is far away from the store,
and the previous load-and-or-store pattern disappears.

A simple idea to fix it is to move the shrinking in a very early pass
before the first pass of EarlyCSE. However, if we move shrinking
earlier, it is possible to block the coalescing of other ilegal type
bitfields which can not be shrinked. So for coalescing and shrinking,
no matter which one is first, it will block the other one.

After some discussions with Eli and Michael, I come up with another
idea to let shrinking stay in the late pipeline. It needs two changes
to the current patch:

  1. extending the pattern match to handle store(or(and(or(and…))
    pattern above. It needs to analyze the bit mask of every and-or pairs.
    If the last and-or pair touch different section with the other and-or
    pairs, we can split the original store into two, and do the shrinking
    for the second store, like below

%bf.load = load i64, i64* %0, align 8
%bf.clear = and i64 %bf.load, -65536
%bf.set = or i64 %bf.value, %bf.clear

%bf.clear1 = and i64 %bf.set, -4294901761
%bf.set1 = or i64 %bf.value1, %bf.clear1
store i64 %bf.set1, i64* %0, align 8 // the first
store.

%bf.value2.shrinked = shrink_op %bf.value2
store i16 %bf.value2.shrinked, i64* %0, align 8 // shrinked store.

  1. use memoryssa + alias query to do the safety check. Because
    dominator tree is not properly updated in CGP, I have to create a new
    pass and put it before CGP, in order to use memoryssa.

This makes sense to me. Should we just fix CGP to update the DT instead of
working around it?

-Hal
I am not familiar enough with CGP to tell. I just notice ModifiedDT is
modified at several places in CGP, which indicates there are a few
transformations needing their specific dominator tree maintainance
work. And I remember simplifyCFG also doesn’t use dominator tree to
save the effort to maintain it on the fly. So maybe it is easier to
separate CGP into two parts: which does need dominator tree and
doesn’t. Those transformations which don’t need dominator tree can
stay together.

Currently, I know consthoisting is another potential CGP
transformation but left out because of its need of dominator tree. But
consthoisting has already evolved to contain a substantial amount of
code so may be better to stay as a separate pass.

That might very well be the case here too (the matching logic might grow
in complexity). CGP has now grown to over 5K lines, and there are
definitely pieces that can get split out. There is a big chunk that runs
iteratively, however, so that might not buy us as much as we might hope.

I think a lot of the reasons why we used CGP as a grabbag was because for a while we didn’t really have a compelling model for why we were moving out of canonical form. But we’ve really clarified the pass pipeline structure now so I’d love to try and build reasonable “optimization” (as opposed to “canonicalization” or “simplification”) passes and organize them nicely.

One perhaps especially notable thing: this will often end up narrowing. I think that it would be very valuable to try and place this or other passes that effectively narrow data access types before the vectorizers. This seems important to enable efficient packing of element types.

Past that, I generally like the plan of a dedicated pass to narrow integer stuff across basic blocks, etc., using MemorySSA and other efficient tools. =]

-Chandler

We have had *numerous* test cases from benchmarks where combining basic
bitfield math is important. So while in some cases it may not be
important, in a number of cases it is and so I feel like we should
design things to support reasonably comprehensive combining of the
bitfield accesses.

Interesting. What usually happens in our cases is that the bitwise extraction logic gets combined into code that is hard to generate optimal code for. It's usually the shifts and ands from the extraction code itself that get rewritten between themselves, but I haven't seen many cases where the surrounding arithmetic would be a part of it (I meant addition, subtraction, etc, not further bitwise ops). When the surrounding code has more bitwise operations on the extracted bitfield, the extraction may be completely eliminated (which is something that the combiner could do).

I agree that if we tried to aggressively canonicalize towards such intrinsics, it would likely create more problems than it would solve.

That said, it was just a loose idea, not something I intend to insist on...

    > The main generator of the bitwise manipulation is SROA

?? Clang directly generates the bit insert and extract patterns as well?

Sorry, I didn't mean to pick on SROA. :wink:

-Krzysztof

Sounds good (although we still need to be careful not to lose the iterative nature of the current optimizations in cases where that’s important). I think it would be useful to work through some examples in this context. It might help vectorization to narrow beforehand, or it might end up requiring more scatter/gather load/stores (whereas, in the context of vectorization, the shifts and masks might have been more efficient). -Hal