[RFC] Stripping unusable intrinsics

This seems to indicate that the idea is a workable solution to your use case.

We’ve still got some bikesheds to get through. (And I don’t consider myself a good reviewer for Clang in this, so we need to identify who is.)

As background to this extension idea, let me say that I prefer the notion that developers, particularly perf-aware developers like gamedevs, will want to handle PGO via markup. The idea is for them to turn PGO measurements into consciously-chosen markup, since a given test run may not be fully expressive. (e.g. They may not play the whole game through each time.)

So the idea is to have the enum be marked with a weight instead of just “impossible.” The impossible value would be perhaps -inf. I’d suggest that weights could be either float 0 to 1 or 0 to 100, whichever works best with the PGO infrastructure.

Alex

This seems to indicate that the idea is a workable solution to your use
case.

We've still got some bikesheds to get through. (And I don't consider
myself a good reviewer for Clang in this, so we need to identify who is.)

The patch looks pretty reasonable. You should send it to Aaron Ballman,
too. He's been doing all the Clang attribute machinery refactoring to make
them easy.

As background to this extension idea, let me say that I prefer the notion

that developers, particularly perf-aware developers like gamedevs, will
want to handle PGO via markup. The idea is for them to turn PGO
measurements into consciously-chosen markup, since a given test run may not
be fully expressive. (e.g. They may not play the whole game through each
time.)

So the idea is to have the enum be marked with a weight instead of just
"impossible." The impossible value would be perhaps -inf. I'd suggest that
weights could be either float 0 to 1 or 0 to 100, whichever works best with
the PGO infrastructure.

Personally, I don't think we need to conflate impossible and improbable.
They seem different enough to warrant a second attribute. A separate
"enum_probability" attribute is interesting, though.

I will second this, but more strongly. PGO is often sample based. As a result, a 0% observed probability does not imply that the code is always unreachable. Assuming otherwise would be incorrect. Having a mechanism to distinguish between ‘I’ve seen no samples here’ and ‘this code is guaranteed unreachable’ is required for correctness. As a bikeshed suggestion, why not name the impossible attribute [[unreachable]]? We use that name for llvm_unreachable and related things. The semantics are pretty much the same, just expressed slightly differently right? Actually, why can’t you’re objective be achieved using the existing llvm_unreachable mechanism? Create a macro which expands to llvm_unreachable on all but one target, use that in the implementation of the case statement. E.g. case something_x86_specific: { REACHABLE_ON_X86_ONLY(); … actual code here } Philip

I think it can be achieved this way, but it would require making widespread
manual LLVM changes instead of making minor attribute changes to generated
headers. Seems like a win to me. :slight_smile:

I think they are radically different, and we shouldn't really discuss them
the same way.

The "impossible" *cannot* be produced through profiles for example. We're
talking about an attribute that will cause the code to *miscompile* if a
particular value ever appears.

FWIW, I still strongly object to this entire approach.

If we want to add support for this magical attribute to clang, fine, but
honestly I don't think it's that relevant for the issue at hand.

The critical question to my mind is why are intrinsics taking up so much
space. I think it would be much more profitable to make intrinsics compact
than to add hacks to LLVM to exclude target intrinsics entirely. Especially
based on a fairly subtle and magical enum attribute. It changes too much of
the meaning of enums.

For example, what happens when an "impossible" enum is used to compute the
value of some other enum? At what number do you start up numbering those
enums listed after an impossible enum? I see no easy, obvious, and
unsurprising answers here. I don't want to have to think this much about
something that should be *extremely* basic.

If the target-independent optimizer is being bloated by code to handle
optimizing target-specific intrinsics, we should look at factoring such
code into an IR-level transform library in the target that can be inserted
into the optimizer pass pipeline like anything else. But I have my doubts
that this is the cause.

If the target-specific intrinsics are carrying a huge amount of code in the
core IR library in order to support parsing, serializing, printing, etc.,
we should really find a more compact way of doing this. I cannot fathom how
this could require 500kb of data (or text) to handle. We're doing something
deeply wrong with intrinsics, and I think it is a mistake to pursue
anything until we understand that.

The vast majority of the space savings comes from reduced code size (not constant data), which was somewhat surprising to me at first, but in retrospect makes a lot of sense. In particular the TableGen-generated code that parses intrinsic names to resolve strings->ids, is pretty massive.

This could be un-affected. My clang changes only impact when impossible enums are used in conditional statements and switch statements. That aside, in our specific use I’d be pretty terrified if we’re using the value of intrinsic ids to compute other intrinsic ids because I don’t believe we have any guarantee of the ordering or values of intrinsic ids. From looking at the TableGen backend it looks to me like the id is really just the index into the records list, which is constructed in order of the parsing, and those indexes are used all over the place and assumed to match.

The way I handled numbering in my patch was to make the impossible enums immediately follow the last possible enum, and it all worked fine.

I understand all your points here. Independent of whether or not this is the right approach for handling intrinsics I’m going to clean-up and propose my patches to clang to add the attribute because it does seem useful to me.

-Chris

More diffs to enjoy.

I’ve updated my tablegen patches to work with the clang::impossible_enum
attribute. With these changes I am seeing the same code size regressions
that #ifdefs were showing — without all the ifdefs.

That’s about a 2% decrease in our library for WebKit.

See attached patches for llvm & clang.

FWIW, I still strongly object to this entire approach.

If we want to add support for this magical attribute to clang, fine, but
honestly I don't think it's that relevant for the issue at hand.

The critical question to my mind is why are intrinsics taking up so much
space. I think it would be much more profitable to make intrinsics compact
than to add hacks to LLVM to exclude target intrinsics entirely. Especially
based on a fairly subtle and magical enum attribute. It changes too much of
the meaning of enums.

The vast majority of the space savings comes from reduced code size (not
constant data), which was somewhat surprising to me at first, but in
retrospect makes a lot of sense. In particular the TableGen-generated code
that parses intrinsic names to resolve strings->ids, is pretty massive.

Without having looked at the code in question, could we maybe just have a
binary search on a table? (possibly an existing table?)

-- Sean Silva

A binary search on a table would be possible, but we’d need to generate a table that was sorted. Today the order of intrinsics is really just the order that TableGen parses them in.

-Chris

Circling back to Chandler on file size differences. Here are the highlights of what is different.

For my analysis I built LLVM and Clang using a clang built with my patches. The same clang was used for the baseline and the stripped build. I used the following CMake command:

cmake -G “Sublime Text 2 - Ninja” -DCMAKE_BUILD_TYPE=Release -DLLVM_BUILD_LLVM_DYLIB=Yes -DLLVM_DISABLE_LLVM_DYLIB_ATEXIT=On -DCMAKE_INSTALL_PREFIX=/Users/cbieneman/dev/llvm-install/ -DLLVM_TARGETS_TO_BUILD=“AArch64;ARM;X86” …/llvm

and built using Ninja.

Created a fresh build directory and built once as a baseline from the following revisions:

LLVM - ba05946
lld - 33bd1dc
clang - 1589d90 (With my patches applied)

I then applied my tablegen and CMake patches, made a new build directory, and built a second time. I then compared the file sizes between the two directories by diffing the output of:

find . -type f -exec stat -f ‘%N %z’ ‘{}’ + | sort

The biggest benefits are an 11% reduction in size for libLLVMCore, which is mostly due to Function.cpp.o reducing in size by 300KB (almost 39%). The biggest thing in there that would contribute to actual code size is the almost 28,000 line switch statement that provides the implementation for Function::lookupIntrinsicID.

I’ve attached a CSV file with the file size highlights for anyone who is interested.

-Chris

Intrinsic Stripping.csv (701 Bytes)

That makes sense. It sounds like there is a better design here: we should move to a model where intrinsic tables are registered by any targets that are activated. That would allow the intrinsic tables (including these switch/lookup mapping tables) to be in the target that uses them.

It should be straight-forward to have something like LLVMInitializeX86Target/RegisterTargetMachine install the intrinsics into a registry.

-Chris

I tried doing that a few years ago. It’s not nearly as easy as it sounds because we’ve got hardcoded references to various target intrinsics scattered throughout the code.

I think that’s where we work on that abstraction for the middle end so we don’t need to worry about it.

I was just writing to say exactly this. There are a number of ways we could work toward something like this. I’m completely in favor of a world where Intrinsics are properties of the targets and don’t leach out, however today they do in a lot of places.

Some of these places could be replaced with subtarget hooks with very little issue, and we could certainly have target initialization register intrinsics.

One cost of having intrinsics live in the targets is that we actually get some pretty substantial savings today in our constant data size by having all the intrinsics generated together. Today Intrinsic IDs are used as indices to tables that map to other characteristics, and there is uniquing performed on the tables to reduce their size.

One possible solution would be to only have the intrinsic enum remain part of the IR library, but have everything else get broken up by target. We could make this work by having Tablegen specify enum values using the high-order bits to signify which target, and the low-order bits to signify the index into that target’s intrinsic tables. This would probably get us a big chunk of the overall code size savings.

I also think that simply sorting the intrinsics by name, and doing a binary sort like Sean suggested should allow us to save a lot of code size, and probably make this faster too.

-Chris

What are the specific problems here? Anything that does an equality comparison with the IntrinsicID can be changed to do strcmp with the name. That would handle the one-off cases like InstCombiner::SimplifyDemandedUseBits in InstCombine.

The other cases in InstCombine could be handled similarly, but may be better handled by adding a intrinsic behavior query APIs to the intrinsic registry, or would be better handled (eventually) by adding new attributes to the intrinsics themselves.

-Chris

I don’t think there’s anything fundamentally difficult, but it’s a big change. For example:

$ git grep Intrinsic:: | wc
3364 12286 281078

The vast majority of those 3,364 lines have hardcoded references to specific intrinsics. Many of them are used in contexts where you can’t easily insert a strcmp (e.g., case values in large switch statements, or worse, the m_Intrinsic PatternMatch templates).

I had patches earlier on this thread that stripped the uses of intrinsics from LLVM using ifdefs.

These files have intrinsics as switch cases. I’ve listed which target the intrinsics belong to as well for reference.

lib/Analysis/BasicAliasAnalysis.cpp - ARM
lib/Analysis/ConstantFolding.cpp - X86
lib/Analysis/ValueTracking.cpp - X86
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp - X86
lib/Transforms/InstCombine/InstCombineCalls.cpp - PowerPC, X86, ARM, AArch64, and R600
lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp - X86
lib/Transforms/Instrumentation/MemorySanitizer.cpp - X86
lib/Transforms/Scalar/LoopStrengthReduce.cpp - X86

In InstCombineCalls some of the case bodies are shared by ARM and AArch64 intrinsics. In those cases there were also some if statements that matched on the intrinsic enums too. As an example see the case at InstCombineCalls.cpp line 922-1028.

Other than that, lib/IR/AutoUpgrade.cpp has a bunch of string matched X86 Intrinsics being mapped to enums.

Clang also has a bunch of places that use target intrinsics that were not as easy to ifdef, so I didn’t try.

As Bob said, I don’t think this is particularly a difficult undertaking, but it will be a big change. I think a lot of the switch statements could be replaced by per-target hooks.

Personally I’d love to be in a world where intrinsics lived in the targets, but we won’t get there quickly. I think a stop-gap solution for our more immediate needs could be accomplished with Alex’s suggestion of a clang attribute. I’ve sent patches to clang (http://reviews.llvm.org/D6763). One of the allures I see to that approach is that it would allow us to get what we need today without significant modifications to LLVM, and it would not get in the way of working towards the “Right™” solution.

-Chris

I don’t find this convincing. It should be simple to introduce a new m_Intrinsic PatternMatch template that takes a string. The switches are also straight-forward. In BasicAA, for example, instead of:

switch (II->getIntrinsicID()) {
default: break;
case Intrinsic::memset:
case Intrinsic::memcpy:
case Intrinsic::memmove: {

case Intrinsic::arm_neon_vld1: {
assert(ArgIdx == 0 && “Invalid argument index”);
assert(Loc.Ptr == II->getArgOperand(ArgIdx) &&
“Intrinsic location pointer not argument?”);
// LLVM’s vld1 and vst1 intrinsics currently only support a single
// vector register.
if (DL)
Loc.Size = DL->getTypeStoreSize(II->getType());
break;
}

}

Just change this to:

switch (II->getIntrinsicID()) {
case Intrinsic::memset:
case Intrinsic::memcpy:
case Intrinsic::memmove: {

default:
if (II->getName() == “llvm.arm.neon.vld1”) {
same stuff as above
}
break;

Spreading ifdefs through the codebase would be really unfortunate.

-Chris

I agree. I wasn’t at all trying to discourage anyone from doing as you suggest — just pointing out that it’s not a small project. Much of it would be mechanical, though, and it does seem like the right long-term solution.

It should be straight-forward to have something like
LLVMInitializeX86Target/RegisterTargetMachine install the intrinsics into a
registry.

I tried doing that a few years ago. It’s not nearly as easy as it sounds
because we’ve got hardcoded references to various target intrinsics
scattered throughout the code.

I was just writing to say exactly this. There are a number of ways we
could work toward something like this. I’m completely in favor of a world
where Intrinsics are properties of the targets and don’t leach out, however
today they do in a lot of places.

What are the specific problems here? Anything that does an equality
comparison with the IntrinsicID can be changed to do strcmp with the name.
That would handle the one-off cases like
InstCombiner::SimplifyDemandedUseBits in InstCombine.

The other cases in InstCombine could be handled similarly, but may be
better handled by adding a intrinsic behavior query APIs to the intrinsic
registry, or would be better handled (eventually) by adding new attributes
to the intrinsics themselves.

I don’t think there’s anything fundamentally difficult, but it’s a big
change. For example:

$ git grep Intrinsic:: | wc
    3364 12286 281078

The vast majority of those 3,364 lines have hardcoded references to
specific intrinsics. Many of them are used in contexts where you can’t
easily insert a strcmp (e.g., case values in large switch statements, or
worse, the m_Intrinsic PatternMatch templates).

I don’t find this convincing. It should be simple to introduce a new
m_Intrinsic PatternMatch template that takes a string. The switches are
also straight-forward. In BasicAA, for example, instead of:

   switch (II->getIntrinsicID()) {
    default: break;
    case Intrinsic::memset:
    case Intrinsic::memcpy:
    case Intrinsic::memmove: {
    ...
    case Intrinsic::arm_neon_vld1: {
      assert(ArgIdx == 0 && "Invalid argument index");
      assert(Loc.Ptr == II->getArgOperand(ArgIdx) &&
             "Intrinsic location pointer not argument?");
      // LLVM's vld1 and vst1 intrinsics currently only support a single
      // vector register.
      if (DL)
        Loc.Size = DL->getTypeStoreSize(II->getType());
      break;
    }
}

Just change this to:

   switch (II->getIntrinsicID()) {
    case Intrinsic::memset:
    case Intrinsic::memcpy:
    case Intrinsic::memmove: {
    ...
    default:
      if (II->getName() == “llvm.arm.neon.vld1”) {
        same stuff as above
      }
    break;

If we're going to talk about what the right long-term design is, let me put
out a different opinion. I used to be somewhat torn on this issue, but this
discussion and looking at the particular intrinsics in question, I'm
rapidly being persuaded.

We shouldn't have any target specific intrinsics. At the very least, we
shouldn't use them anywhere in the front- or middle-end, even if we have
them.

Today, frontends need to emit specific target instrinsics *and* have the
optimizer be aware of them. I can see a few reasons why:

1) Missing semantics -- the IR may not have *quite* the semantics desired
and provided by the target's ISA.
2) Historical expectations -- the GCC-compatible builtins are named after
the instructions, and the target independent builtins lower to intrinsics
so the target-specific ones should too.
3) Poor instruction selection -- we could emit the logic as boring IR, but
we fail to instruction select that well, so as a hack we emit the
instruction directly and teach the optimizer to still optimize through it.

If we want to pursue the *right* design, I think we should be fixing these
three issues and then we won't need the optimizer to be aware of any of
this.

Fixing #1) We just need to flush these out and define target independent
intrinsics. The number of these is relatively small, and not likely to be a
burden. I'm looking at you cvtss2si.

Fixing #2) I think this just requires a giant pile of work to re-target
builtins and other things that people expect the optimizer to operate on
into proper IR. Yes, it's hard, but it is the right design. With x86 we're
already about 90% of the way there. My understanding is that ARM has gone
the other direction. While unpleasant, I think we have to make a choice:
either the builtin doesn't get optimized or it gets lowered to "normal" IR.

Fixing #3) Yes, we need better instruction selection. We have always needed
better instruction selection. If there are truly critical needs that aren't
met, we can even achieve this through much more localized hacks *in the
backend* that form specific intrinsics to "pre select" instructions that we
know are important. While just as hacky and gross, this seems like a much
better tradeoff to make than the current approach.

In the end, we don't have target-specific intrinsics in the middle end at
all. Then we get to decide on whether it is worth our time to provide a
really efficient target-specific *encoding* of these operations, or if we
should just lower to the target specific encoding we already have: inline
assembly. I don't have strong feelings about this either way.

OK, so I think that is the right design, but I also completely agree with
Chris B here -- it is *way more work* than is really called for to address
a very narrow, targeted problem that WebKit has: binary size.

But I also agree with Chris L's point here:

Spreading ifdefs through the codebase would be really unfortunate.

I just don't think this is necessary. And I don't think relying on a
fragile frontend-based "optimization" to achieve the same effects is really
great either.

The huge code in IR/Function.cpp.o is from *terrible* code being emitted by
tablegen. We should just change tablegen to emit something smaller. This
will get most of the size win, and will get it even when the target *is*
compiled in!

In particular, if we switch to a sorted table of intrinsic names, change
the *lookup* of names to use the same sorted table so we share the memory,
and do a binary search, I think the code size will be essentially free.

Once we address that, Chris B can use the somewhat ugly #ifdef's or his
clang attribute patch to look for other hotspots where we burn size on
intrinsics. I bet a few more could be fixed by similar effort.

If the code size is *still* a problem, we could could start factoring
things incrementally into target independent intrinsics and removing the
target specific ones which might help incrementally move us toward the
"right design" (IMO).

Then if someone is really motivated (which would be cool) to do all the
hard work to really move us over to the right design, we would have lost
nothing, there would be essentially no more technical debt than there is
today, and WebKit and others won't have to wait for their binary size
improvements.

-Chandler