RFC: speedups with instruction side-data (ADCE, perhaps others?)

I’ve been playing around with optimizing performance various passes and noticed something about ADCE: it keeps an Alive set that requires a lot of upkeep time-wise, but if each instruction had a /single bit/ of side data (to represent liveness, solely within the pass), the set wouldn’t be needed. I tested this out and observed a ~1/3 reduction in time in ADCE: 1454ms to 982ms according to a profile over our test suite (for a total of 0.6% compile time savings in our pipeline).

Are there any other passes that could benefit from having a single bit (or similarly small amount) of per-Instruction metadata local to the pass, i.e. to avoid having to keep a side-Set of instructions that’s only used to test membership of the set (i.e. not iterated over)? Is this sort of thing something reasonable (maybe it could be stuffed in the SubclassData short and exposed via a public API?)

—escha

Here’s a simple prototype of the kind of thing I mean:

diff --git a/include/llvm/IR/Value.h b/include/llvm/IR/Value.h
index 9d83cef..d48f8f3 100644
--- a/include/llvm/IR/Value.h
+++ b/include/llvm/IR/Value.h
@@ -104,13 +104,17 @@ protected:
   ///
   /// Note, this should *NOT* be used directly by any class other than User.
   /// User uses this value to find the Use list.
- enum : unsigned { NumUserOperandsBits = 29 };
+ enum : unsigned { NumUserOperandsBits = 28 };
   unsigned NumUserOperands : NumUserOperandsBits;

   bool IsUsedByMD : 1;
   bool HasName : 1;
   bool HasHungOffUses : 1;

+ /// \brief A 1-bit marker that can be used freely within passes, e.g. for
+ /// mark-sweep algorithms like ADCE, to avoid keeping a side data set.
+ bool Marker : 1;

FWIW, I have a target specific pass that needs exactly this sort of
tracking. Once you offer 1 bit, it's a slippery slope to giving a
whole pointer. I like your idea regardless.

I did something similar for dominators, for GVN, etc.
All see significant speedups.

However, the answer i got back when i mentioned this was "things like
ptrset and densemap should only have a small performance difference
from side data when used and sized right", and i've found this to
mostly be true after looking harder.

In the case you are looking at, i see:

- SmallPtrSet<Instruction*, 128> Alive;

This seems ... wrong.
In fact, it seems optimally bad.

This says the small ptr set size is 128.
That is, the smallsize is 128.

SmallPtrSet will linear search the array when it's size <= smallsize,
and otherwise fall back to building a non-small array and using better
algorithms.

Linear searching a 128 member array == slow

I bet if you change the 128 to 8, you will see significant speedups.

I tried changing it to 16 and only saw a very small speedup. I’m pretty sure that no matter how you slice it, adding every single instruction in a function to a set is going to be expensive.

—escha

Given the use case here, I somewhat agree. In the common case, you're going to need an extra O(I) memory to represent the alive set. However, there are definitely ways to do better without needing transient data stored in the instruction itself.

Have you considered a bloom filter or something equivalent for testing aliveness? We don't actually need a precise result here - right?. If we're willing to tolerate a slightly imprecise result (in theory), using a bloom filter could give noticeable memory reduction.

p.s. I disagree with Daniel on sizing. While linear search over a 128 element array is somewhat slow, memory allocation/churn can be far far worse. I haven't looked at this particular use case, but in general, larger small sizes for frequently called functions can make a lot of sense.

Philip

I should note, when I was writing my simplifyInstructionsInBlock replacement, I got a similarly large speedup (~1000ms -> 700ms) with the trick where I only added instructions to the worklist if they needed to be /revisited/, as opposed to initting the entire worklist with all the instructions. This is likely a similar gain as to what I got here because it’s the difference between “put everything in the function in a set” and “putting only a few things from the function in a set”, so the gain from Not Putting Tons Of Stuff In A Set seems to be pretty consistent across widely varying use-cases.

—escha

What is the contract between passes for the marker bit? I.e. do passes need to “clean up” the marker bit to a predetermined value after they run? (or they need to assume that it might be dirty? (like I think you do below))

– Sean Silva

What is the contract between passes for the marker bit? I.e. do passes need to “clean up” the marker bit to a predetermined value after they run? (or they need to assume that it might be dirty? (like I think you do below))

Good questions, and to add to this, what happens if we run passes in parallel?

I can imagine a world where we run 2 function passes in parallel so tagging instructions is ok, but what if we then started to run 2 BB passes in parallel, or used the tag to track something like GlobalValue. Going to break horribly in such scenarios.

Pete

I tried changing it to 16 and only saw a very small speedup. I’m pretty sure that no matter how you slice it, adding every single
instruction in a function to a set is going to be expensive.

Yes, it will be.

For reference, we had two ways of doing this in GCC:

One, things had id numbers, so we could create bitmaps that told us
what was in a set without storing pointers :slight_smile:
There were also flags, but in practice, flags were as fast as the bitmaps.

What is the contract between passes for the marker bit? I.e. do passes
need to "clean up" the marker bit to a predetermined value after they run?
(or they need to assume that it might be dirty? (like I think you do below))

Good questions, and to add to this, what happens if we run passes in
parallel?

I can imagine a world where we run 2 function passes in parallel so
tagging instructions is ok, but what if we then started to run 2 BB passes
in parallel, or used the tag to track something like GlobalValue. Going to
break horribly in such scenarios.

Do we ever plan to have two passes mutating the IR in parallel? I assume
there would be much larger complications.

-- Sean Silva

Oh yeah, it would be a nightmare to find all the places to fix. So no idea if we’ll ever do it, but if we were going to, then this would make it even harder. Just want to make sure we’re ok with that.

Personally I don’t like the idea of adding a bit to User to track this. If we don’t have the correct data structure to track this, then I feel like we should try write one… if thats even possible.

Pete

I would assume that it’s just considered to be garbage. I feel like any sort of per-pass side data like this should come with absolute minimal contracts, to avoid introducing any more inter-pass complexity.

—escha

I would think this would need to be a verifier error if it were ever non-0

I would assume that it’s just considered to be garbage. I feel like any sort of per-pass side data like this should come with absolute minimal contracts, to avoid introducing any more inter-pass complexity.

I would think this would need to be a verifier error if it were ever non-0

+1

Otherwise every pass which ever needs this bit would have to first zero it out just to be safe, adding an extra walk over the whole functions.

Of course otherwise the pass modifying it will have to zero it, which could also be a walk over the whole function. So either way you have lots iterate over, which is why i’m weary of this approach tbh.

My ADCE patch doesn’t need an extra walk over the function; it just marks them as Alive or Not Alive during the first walk.

—escha

Every pass which ever uses this internally would have to set it to zero when it is done, adding an extra walk over the whole functions as you noticed. This goes against “you don’t pay for what you don’t use”, so definitively -1 for this. Better to cleanup before use.
I agree that the approach does not scale/generalize well, and we should try to find an alternative if possible. Now *if* it is the only way to improve performance significantly, we might have to weight the tradeoff.

Does anyone have any concrete alternative suggestions to achieve the speedup demonstrated here?

—Owen

From: "Owen Anderson via llvm-dev" <llvm-dev@lists.llvm.org>
To: "Mehdi Amini" <mehdi.amini@apple.com>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Tuesday, September 15, 2015 4:16:58 PM
Subject: Re: [llvm-dev] RFC: speedups with instruction side-data
(ADCE, perhaps others?)

I would assume that it’s just considered to be garbage. I feel like
any sort of per-pass side data like this should come with absolute
minimal contracts, to avoid introducing any more inter-pass
complexity.
I would think this would need to be a verifier error if it were ever
non-0
+1

Otherwise every pass which ever needs this bit would have to first
zero it out just to be safe, adding an extra walk over the whole
functions.

Of course otherwise the pass modifying it will have to zero it, which
could also be a walk over the whole function. So either way you have
lots iterate over, which is why i’m weary of this approach tbh.

Every pass which ever uses this internally would have to set it to
zero when it is done, adding an extra walk over the whole functions
as you noticed. This goes against “you don’t pay for what you don’t
use”, so definitively -1 for this. Better to cleanup before use.
I agree that the approach does not scale/generalize well, and we
should try to find an alternative if possible. Now *if* it is the
only way to improve performance significantly, we might have to
weight the tradeoff.

Does anyone have any concrete alternative suggestions to achieve the
speedup demonstrated here?

Are these actually free bits, or does this increase the size of Value? If they're free, how many free bits do we have in Value and does this differ on common 32-bit and 64-bit systems?

Also, this:

I should note, when I was writing my simplifyInstructionsInBlock
replacement, I got a similarly large speedup (~1000ms -> 700ms) with
the trick where I only added instructions to the worklist if they
needed to be /revisited/, as opposed to initting the entire worklist
with all the instructions. This is likely a similar gain as to what
I got here because it’s the difference between “put everything in
the function in a set” and “putting only a few things from the
function in a set”, so the gain from Not Putting Tons Of Stuff In A
Set seems to be pretty consistent across widely varying use-cases.

Can this be generalized to other relevant use cases?

-Hal

—Owen
_______________________________________________
LLVM Developers mailing list
llvm-dev@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

--

Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

I would assume that it’s just considered to be garbage. I feel like any sort of per-pass side data like this should come with absolute minimal contracts, to avoid introducing any more inter-pass complexity.

I would think this would need to be a verifier error if it were ever non-0

+1

Otherwise every pass which ever needs this bit would have to first zero it out just to be safe, adding an extra walk over the whole functions.

Of course otherwise the pass modifying it will have to zero it, which could also be a walk over the whole function. So either way you have lots iterate over, which is why i’m weary of this approach tbh.

Every pass which ever uses this internally would have to set it to zero when it is done, adding an extra walk over the whole functions as you noticed. This goes against “you don’t pay for what you don’t use”, so definitively -1 for this. Better to cleanup before use.
I agree that the approach does not scale/generalize well, and we should try to find an alternative if possible. Now if it is the only way to improve performance significantly, we might have to weight the tradeoff.

Does anyone have any concrete alternative suggestions to achieve the speedup demonstrated here?

Honestly, not an alternative to speedup ADCE. This appears to be the fastest way to improve that particular pass.

My worry here comes from prior experience. I have done this exactly change years ago on a non-LLVM compiler. Adding one bit to each instruction was great, we used it in a few passes, then we found we needed a second bit, then a third. That quickly became unmanageable, and if we add this bit then we put ourselves on a repeat of that.

I think we should weigh up what this change gets us in overall compile time. A 30% speedup in ADCE is great, and if a few other passes would benefit from this and get similar speedups, then even better. However, if thats only worth say 1% of overall compile time, then i’d argue its not worth it.

Pete