[RFC] Stripping unusable intrinsics

llvm-dev,

In my ongoing saga to improve LLVM for embedded use, we would like to support stripping out unused intrinsics based on the LLVM targets actually being built.

I’ve attached two patches.

The first is a new flag for tablegen to take a list of targets. If passed tablegen will only emit intrinsics that either have empty target prefixes, or target prefixes matching one of the targets in the list. If the flag is not passed the behavior is unchanged. This patch can land today (subject to review).

The second patch is a WIP, and adds support to the CMake build system for using the new tablegen flag, and for generating a new llvm/Config/llvm-targets.h header which contains defines for each target specified with LLVM_TARGETS_TO_BUILD.

This new header will allow us to #ifdef code using target-specific intrinsics outside the targets, thus allowing us to strip out all the unused intrinsics.

-Chris

cmake-build.diff (1.72 KB)

tablegen.diff (1.49 KB)

How much code is there that looks at target specific intrinsics from generic IR passes? Can we move this code into something like TargetTransformInfo?

Chris,

How much do you save by doing this?

-Hal

How much code is there that looks at target specific intrinsics from
generic IR passes?

I suspect quite a bit transitively. ValueTracking, constant folding,
instsimplify, instcombine, etc.

Can we move this code into something like TargetTransformInfo?

Yikes, no. That's behind an abstract common interface. All of the above are
specialized specific uses of narrow interfaces in relative hot paths of the
optimizer.

How much code is there that looks at target specific intrinsics from generic IR passes? Can we move this code into something like TargetTransformInfo?

InstCombineCalls does some of this. Ideally we’d move this to something like InstCombineCalls_X86, etc. That or at the very least a method for each target which is #ifdef’ed .

There’s also code in SelectionDAGBuilder for lowering some target specific intrinsics.

Thanks,
Pete

Yeah. Quite a bit is a good way to put it. Just looking for ‘Intrinsic::x86’ hits AutoUpgrade, ConstantFolding, InstCombineCalls, LoopStrengthReduce, MemorySanitizer, SelectionDAGBuilder and ValueTracking.

I didn’t look at other targets. ARM is probably supported in a similar number of places to x86. Other targets probably less so.

Pete

llvm-dev,

In my ongoing saga to improve LLVM for embedded use, we would like to support stripping out unused intrinsics based on the LLVM targets actually being built.

I’ve attached two patches.

The first is a new flag for tablegen to take a list of targets. If passed tablegen will only emit intrinsics that either have empty target prefixes, or target prefixes matching one of the targets in the list. If the flag is not passed the behavior is unchanged. This patch can land today (subject to review).

The second patch is a WIP, and adds support to the CMake build system for using the new tablegen flag, and for generating a new llvm/Config/llvm-targets.h header which contains defines for each target specified with LLVM_TARGETS_TO_BUILD.

This new header will allow us to #ifdef code using target-specific intrinsics outside the targets, thus allowing us to strip out all the unused intrinsics.

I like the general idea and, as you asked on irc, will happily help with the autoconf changes. Do you have a small (even pseudo) code example of what the changes to the middle end machinery will look like?

-eric

FWIW, I'm actually kinda terrified of the changes this will require
throughout the optimizer.

I think it will be particular hard when doing canonicalization to avoid
subtle bugs where the lack of a code path to handle a target intrinsic
causes a different ranking of patterns... Maybe my fears are misplaced
here, but it isn't yet apparent how to separate all of the uses of target
intrinsics in the optimizer out of the surrounding code cleanly, without
severely negatively impacting the readability and maintainability of the
code.

However, if the savings of doing this are massive, then perhaps its worth
all the complexity it brings. But if the savings are small, I'm much more
skeptical. So I feel like we really need an answer to Hal's question.

That’s kinda why I was interested in what it looked like in changing the middle end :slight_smile:

Numbers are good though.

-eric

llvm-dev,

In my ongoing saga to improve LLVM for embedded use, we would like to support stripping out unused intrinsics based on the LLVM targets actually being built.

I’ve attached two patches.

The first is a new flag for tablegen to take a list of targets. If passed tablegen will only emit intrinsics that either have empty target prefixes, or target prefixes matching one of the targets in the list. If the flag is not passed the behavior is unchanged. This patch can land today (subject to review).

The second patch is a WIP, and adds support to the CMake build system for using the new tablegen flag, and for generating a new llvm/Config/llvm-targets.h header which contains defines for each target specified with LLVM_TARGETS_TO_BUILD.

This new header will allow us to #ifdef code using target-specific intrinsics outside the targets, thus allowing us to strip out all the unused intrinsics.

I like the general idea and, as you asked on irc, will happily help with the autoconf changes. Do you have a small (even pseudo) code example of what the changes to the middle end machinery will look like?

The big change required outside the patches on my first email is that any use of a target intrinsic will need to be #ifdef’d. I haven’t tracked all of those down yet.

-Chris

llvm-dev,

In my ongoing saga to improve LLVM for embedded use, we would like to support stripping out unused intrinsics based on the LLVM targets actually being built.

I’ve attached two patches.

The first is a new flag for tablegen to take a list of targets. If passed tablegen will only emit intrinsics that either have empty target prefixes, or target prefixes matching one of the targets in the list. If the flag is not passed the behavior is unchanged. This patch can land today (subject to review).

The second patch is a WIP, and adds support to the CMake build system for using the new tablegen flag, and for generating a new llvm/Config/llvm-targets.h header which contains defines for each target specified with LLVM_TARGETS_TO_BUILD.

This new header will allow us to #ifdef code using target-specific intrinsics outside the targets, thus allowing us to strip out all the unused intrinsics.

I like the general idea and, as you asked on irc, will happily help with the autoconf changes. Do you have a small (even pseudo) code example of what the changes to the middle end machinery will look like?

The big change required outside the patches on my first email is that any use of a target intrinsic will need to be #ifdef’d. I haven’t tracked all of those down yet.

That’s a little gross. I wonder if there’s a better abstraction for this - I think the “check if a preprocessor define” is valid would be painful for maintenance and testing.

-eric

I’ve got some new patches and some numbers.

The patches are all the changes required to strip intrinsics using the preprocessor defines I showed in my earlier patches. It actually isn’t that much change to LLVM. Clang will need similar changes too. There were only 9 files that referenced target intrinsics outside the corresponding target.

These patches are still WIP, and there is some nastiness, but they work.

Using these patches I see about 500k reduction in binary size when building libLLVM.dylib with just the ARM backend enabled.

du -k …/*.dylib
34236 …/libLLVM-ARM-only-after.dylib
34732 …/libLLVM-ARM-only-before.dylib

These savings are substantial for our use case.

-Chris

strip_intrinsics.diff (18.5 KB)

I’ve got some new patches and some numbers.

The patches are all the changes required to strip intrinsics using the
preprocessor defines I showed in my earlier patches. It actually isn’t that
much change to LLVM. Clang will need similar changes too. There were only 9
files that referenced target intrinsics outside the corresponding target.

These patches are still WIP, and there is some nastiness, but they work.

Using these patches I see about 500k reduction in binary size when
building libLLVM.dylib with just the ARM backend enabled.

> du -k ../*.dylib
34236 ../libLLVM-ARM-only-after.dylib
34732 ../libLLVM-ARM-only-before.dylib

These savings are substantial for our use case.

I'm sort of curious that ~1.5% savings is "substantial". It smells of
diminishing returns/microoptimization. If there is really such a dire need
for size, surely there are whole parts of the API that aren't used by your
application and could be stripped for much greater benefit?

-- Sean Silva

That library isn’t actually representative of what we ship in terms of it’s overall size. For that test I only stripped unused backends. We also strip unused functionality, but we also have out-of-tree functionality that we add in.

I our shipping library for WebKit is under 15MB, which would put this closer to ~3% savings. As far as low-hanging fruit go, that’s a pretty big one.

-Chris

That library isn’t actually representative of what we ship in terms of
it’s overall size. For that test I only stripped unused backends. We also
strip unused functionality, but we also have out-of-tree functionality that
we add in.

I our shipping library for WebKit is under 15MB, which would put this
closer to ~3% savings. As far as low-hanging fruit go, that’s a pretty big
one.

Ah, that makes more sense.

-- Sean Silva

Random shower thought:

I think the markup can be minimized if it only appears once in the header where the enums are defined instead of at every place where the enums are used. Then we could value propagate that certain enum values are never possible where they're checked for. That should generally be able to strip the same set of stuff but use less markup.

Alex

Ok… I bit.

Alex’s proposal here is really compelling to me because it means that the required changes to make this work would be more limited. Specifically a clang attribute could give us all the benefits of #ifdefs throughout the code without the maintenance burden. So, being the silly person I am, I wrote the patches for clang.

I’ve never done any frontend hacking before, so take these with giant cellars of salt, but the concept seems sound to me.

The patches do the following:
(1) Add a new C++11-style [[impossible_enum]] attribute
(2) Any case statement that has [[impossible_enum]] applied to it is not emitted during IRGen - the bodies are always emitted so as not to interfere with fall through, but blocks that cannot be entered are optimized away
(3) Equality comparison against [[impossible_enum]] values are always false, all other comparisons are always true

There was some discussion on IRC today whether or not this was the right way to do this, but I thought I’d send these patches out anyways so people can take a look.

The attached diffs are for clang, I’ve also attached a c++ test file.

-Chris

impossible_enum.cpp (937 Bytes)

impossible_enum.diff (6.97 KB)

Putting aside the several minor bikesheds we will get to if we go this route, how close does this approach get to the code shrink you were originally trying to achieve?

Are there any structures other than switch statements that need to go on a diet too? e.g. comparisons?

Alex

Putting aside the several minor bikesheds we will get to if we go this route, how close does this approach get to the code shrink you were originally trying to achieve?

I haven’t yet adjusted my tablegen changes to take advantage of the, but it is on my list for today.

Are there any structures other than switch statements that need to go on a diet too? e.g. comparisons?

My patches impact both case statements and comparisons. == compares are always false, any other compare is always true.

-Chris

More diffs to enjoy.

I’ve updated my tablegen patches to work with the clang::impossible_enum attribute. With these changes I am seeing the same code size regressions that #ifdefs were showing — without all the ifdefs.

That’s about a 2% decrease in our library for WebKit.

See attached patches for llvm & clang.

-Chris

tablegen.diff (13.2 KB)

impossible_enum.diff (6.97 KB)