calling conventions and inlining

As I've just seen that there are some things going on w.r.t the long needed implementation of calling conventions, may I also ask if it's possible to address inlining at the same moment (i.e. attributes always_inline and noinline, but maybe LLVM wants a finer grained level here) ?

They really are different issues. inlining hints are really hints for the optimizer, where calling convention changes are required for correctness under some circumstances (e.g. to get proper tail calls).

I'd be willing to spend some work on this, but I need the help/pre-work of an expert for the actual bytecode and core classes modifications.

I'm not sure if we want to go this route. Are there cases where the inliner is doing the wrong thing? LLVM has traditionally avoided annotations for optimization hints like this, at least until there is some substantial policy change, I still think it's the right way to go.

-Chris

Chris Lattner wrote:

As I've just seen that there are some things going on w.r.t the long needed implementation of calling conventions, may I also ask if it's possible to address inlining at the same moment (i.e. attributes always_inline and noinline, but maybe LLVM wants a finer grained level here) ?

They really are different issues. inlining hints are really hints for the optimizer, where calling convention changes are required for correctness under some circumstances (e.g. to get proper tail calls).

Well, strictly academically inlining is just an optimizer hint, but in reality inlining also somewhat affects the "semantics" by changing code cache utilization (which becomes more and more important these days).

I don't think it's academic at all. It *is* an optimization hint.

I'd be willing to spend some work on this, but I need the help/pre-work of an expert for the actual bytecode and core classes modifications.

I'm not sure if we want to go this route. Are there cases where the inliner is doing the wrong thing? LLVM has traditionally avoided annotations for optimization hints like this, at least until there is some substantial policy change, I still think it's the right way to go.

At least something for disabling inlining (attribute noinline) is needed, and I had a feeling that such a change would nicely go along with the current calling convention modifications (even if these are completely different things).

I agree that noinline is important in some cases. I think it would be very reasonable to teach the inliner to not inline functions that use the "coldcc" calling convention. If you want to make this change, I would definitely accept it. I don't think that there is any need to add an explicit "do not inline this" attribute though.

-Chris

I understand where you are coming from. Here are the reasons that I think this is a bogus argument:

I. If the function is an obvious *must inline* choice, the compiler will
      trivially see this and inline it. LLVM in particular is very good at
      ripping apart layers of abstraction, and there a couple of known ways
      to improve it further. This leaves the cases where you *dont* want
      to inline stuff and cases where *want* to inline something but it is
      not obvious.

II. For cases where you don't want it to get inlined, arrange for it to
      get marked coldcc and you're done.

III. The only remaining case is when you have a function call that is
      sufficiently "non obvious" to inline but you want to force it to be
      inlined. Usually this is because your are tuning your application
      and note that you get better performance by inlining.

I assume the III is what you're really worried about, so I will explain why I think that a "force_inline" directive on functions is a bad idea.

1. First, this property is something that varies on a *CALL SITE* basis,
    not on a callee basis. If you really want this, you should be
    annotating call sites, not functions.
2. These annotations are non-portable across different compilers and even
    across the different versions of the same compiler. For example, in
    the first case, GCC does no IPO other than inlining, so forcing
    something to be inlined can make a huge difference. LLVM, OTOH, does a
    large amount of IPO and IPA, such as dead argument elimination, IPSCCP,
    and other things. Something that is good to inline for GCC is not
    necessarily good for LLVM.
3. Once these annotations are added to a source base, they are almost
    never removed or reevaluated. This exacerbates #2.

In my mind, the right solution to this problem is to use profile-directed inlining. If you actually care this much about the performance of your code, you should be willing to use profile information. Profile information will help inlining, but it can also be used for far more than just that.

My personal opinion (which you may disagree with strongly!) is that the importance that many people place on "force this to be inlined" directives is largely based on experience with compilers that don't do any non-trivial IPO. If the compiler is doing link-time IPO, the function call boundary is much less of a big deal.

Finally, if you have a piece of code that the LLVM optimizer is doing a poor job on, please please please file a bug so we can fix it!!

-Chris

There is one case where inlining/not-inlining affects correctness. A function which uses alloca() will behave differently in the two cases. You can argue one shouldn't write code like this, but it is legal.

Chris Lattner wrote:

There is one case where inlining/not-inlining affects correctness. A function which uses alloca() will behave differently in the two cases. You can argue one shouldn't write code like this, but it is legal.

The inliner doesn't inline functions that call alloca, or other cases that break correctness.

-Chris

Chris Lattner wrote:

I see that you are objecting explicit inline control.

The main problem is that inlining is absolutely crucial for some "modern" programming styles. E.g. we use a huge collection of small C++ template classes and template metaclasses, most of which have very trivial and limited functionality (think of it as some "bytecode" expressed in classes). Of course, these method calls of these classes _must_ be inlined, but there are also "traditional" calls to other functions which may or may not be meant for inlining. If the inliner just guesses one of these calls wrong (and it usually does) then performance will drop by an order of magnitude. That's why all C++ compilers I know support explicit inline control.

I understand where you are coming from. Here are the reasons that I think this is a bogus argument:

I. If the function is an obvious *must inline* choice, the compiler will
     trivially see this and inline it. LLVM in particular is very good at
     ripping apart layers of abstraction, and there a couple of known ways
     to improve it further. This leaves the cases where you *dont* want
     to inline stuff and cases where *want* to inline something but it is
     not obvious.

II. For cases where you don't want it to get inlined, arrange for it to
     get marked coldcc and you're done.

III. The only remaining case is when you have a function call that is
     sufficiently "non obvious" to inline but you want to force it to be
     inlined. Usually this is because your are tuning your application
     and note that you get better performance by inlining.

I assume the III is what you're really worried about, so I will explain why I think that a "force_inline" directive on functions is a bad idea.

1. First, this property is something that varies on a *CALL SITE* basis,
   not on a callee basis. If you really want this, you should be
   annotating call sites, not functions.
2. These annotations are non-portable across different compilers and even
   across the different versions of the same compiler. For example, in
   the first case, GCC does no IPO other than inlining, so forcing
   something to be inlined can make a huge difference. LLVM, OTOH, does a
   large amount of IPO and IPA, such as dead argument elimination, IPSCCP,
   and other things. Something that is good to inline for GCC is not
   necessarily good for LLVM.
3. Once these annotations are added to a source base, they are almost
   never removed or reevaluated. This exacerbates #2.

In my mind, the right solution to this problem is to use profile-directed inlining. If you actually care this much about the performance of your code, you should be willing to use profile information. Profile information will help inlining, but it can also be used for far more than just that.

My personal opinion (which you may disagree with strongly!) is that the importance that many people place on "force this to be inlined" directives is largely based on experience with compilers that don't do any non-trivial IPO. If the compiler is doing link-time IPO, the function call boundary is much less of a big deal.

Finally, if you have a piece of code that the LLVM optimizer is doing a poor job on, please please please file a bug so we can fix it!!

-Chris

_______________________________________________
LLVM Developers mailing list
LLVMdev@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://mail.cs.uiuc.edu/mailman/listinfo/llvmdev

-Chris

Actually I feel that the current state of the art of inlining is where register allocation has been about 10 years ago. It's pretty fine for most things, but back then I remember writing code like "register const char *p __asm__("%esi");" where just adding the explicit __asm__ boosted performance of some compression routine by about 10% for gcc 2.6.3. Fortunately these times have passed for register allocation, but not yet for inlining.

You've just ignored all of the reasons I gave you above about why this is a bad idea.

Basically using things like _attribute__((__noinline__)) and __declspec(noinline) means the same - they may be unnecessary in ten years, but definitely not today. Just like in the past you can easily boost performance by putting them in the right place, even if it may be necessary to surround this by #ifdef REGISTER_STARVED_CPU tests.

I understand what you're saying. Again, read my last email for why I think this is a bad idea :slight_smile:

Futhermore things like link-time IPO are not that important in C++ template style programs, where much of the whole "program" is available to the compiler during each compilation pass.

I disagree. You're basically saying that because the compiler has all of the stuff as inline functions, it can just inline it all instead of using IPO.

And profile-guided feedback optimizations are just evolving (and IMHO currently still mostly a marketing issue).

Like I said: "If you really care about the performance of your code..."

Finally, if you have a piece of code that the LLVM optimizer is doing a poor job on, please please please file a bug so we can fix it!!

I'll always try my best to help improving LLVM :slight_smile:

Thanks!!

Still, in the tests that are important to me LLVM currently does overally not perform too well,

Can you share those tests?

but looking at the disassembly suggests that this might mainly be an issue of x86 codegen, which is rather young as compared to other compilers.

If you're testing on X86, I would be strongly suspious of the X86 backend, which can be improved in many ways. If you use PowerPC, for example, things are much better (in terms of codegen quality). Using the C backend should help a lot with this though.

Before pointing the finger at the inliner, it would be good to understand what is going on in the testcase. Can you share or reduce the problem to a small testcase?

-Chris

This is where a register allocator that splits live ranges will be most
beneficial. It would not be that hard to leverage the LiveInterval
analysis to implement second-chance bin-packing (a linear scan variant
that performs on the fly live range splitting), but unfortunately noone
has time or interest to work on something like that at this time.

Chris Lattner wrote:

but looking at the disassembly suggests that this might mainly be an issue of x86 codegen, which is rather young as compared to other compilers.

If you're testing on X86, I would be strongly suspious of the X86 backend,

I'll try to come up with some simple X86 example that shows some problems. I haven't looked at the PPC code, though.

My experience with performance optimization has shown me that you shouldn't make assumptions, even based on extensive experience. The only way to track down performance issues is to be analytical and thorough. I look forward to your example.

You've just ignored all of the reasons I gave you above about why this is a bad idea.

Well, that's probably because I have to care for fast code, while you care for a compiler framework that produces fast code. Unfortunatly these sometimes are quite different goals :wink:

See above. Speculating that the inliner is the cause of all of the problems does not seem either analytical or thorough, regardless of goals.

Before pointing the finger at the inliner, it would be good to understand what is going on in the testcase. Can you share or reduce the problem to a small testcase?

Put simply, the inliner is too greedy and nice little leaf functions suddenly run out of CPU registers. Even gcc 3.4 with -funit-at-a-time started inlining too much, but at least I can tell gcc where to stop. This whole noinline issue may be somewhat X86 specific, though.

I can believe that this is an issue, but (as Alkis pointed out) this is a deficiency in the *register allocator* not the inliner. The problem should be fixed there.

-Chris