RFC: PerfGuide for frontend authors

I'd like to propose that we create a new Performance Guide document. The target of this document will be frontend authors, not necessarily LLVM contributors. The content will be a collection of items a frontend author might want to know about how to generate LLVM IR which will optimize well.

Some ideas on topics that might be worthwhile:
- Prefer sext over zext when value is known to be positive in the language (e.g. range checked index on a GEP)
- Avoid loading and storing first class aggregates (i.e. they're not well supported in the optimizer)
- Mark invariant locations - i.e. link to !invariant.load and TBAA constant flags
- Use globals not inttoptr for runtime structures - this gives you dereferenceability information
- Use function attributes where possible (nonnull, deref, etc..)
- Be ware of ordered and atomic memory operations (not well optimized), depending on source language, might be faster to use fences.
- Range checks - make sure you test with the IRCE pass

If folks are happy with the idea of having such a document, I volunteer to create version 0.1 with one or two items. After that, we can add to it as folks encounter ideas. The initial content will be fairly minimal, I just want a link I can send to folks in reviews to record comments made. :slight_smile:

Philip

This would be great for me. I have had many questions about how optimizable different
llvm constructs are.

SGTM.

I like your idea of starting “perf tips” as sort of isolated guidelines for better IRGen. Should allow some nice incremental growth of the documentation. I expect some stuff will need a bit more discussion, so I would like this document to be in a directory docs/FrontendInfo/ or something like that (bikeshed) so that we can easily split out into new docs as needed for some breathing room, or to give some structure for readers to follow.

Our optimizer and backends are our lifeblood, but frontends are our reason for existence. This kind of frontend-oriented documentation has been needed for a long time. Thanks for kicking it off!

– Sean Silva

I’m moving forward with this now given no one has raised objections.

Based on Sean’s comments, the naming I’m going to use is: FrontendInfo/PerfTips

I plan on committing an initial version based on what was discussed here without further review. I’m going to keep the first version short, and then open it for others to contribute.

Philip

On further thought, I don’t think the extra directory structure is warranted yet. We have several other frontend relevant docs at the top level so splitting off a directory without moving them would just make things more confusing.

Given that, tentative name is docs/PerfTips.rst and this will be going out as a code review to give time for others to comment on naming.

Philip

The review is up:

The first version of this document is now live:
http://llvm.org/docs/Frontend/PerformanceTips.html

Please feel free to add to it directly. Alternatively, feel free to reply to this thread with text describing an issue that should be documented. I'll make sure text gets turned into patches.

Philip

From: "Philip Reames" <listmail@philipreames.com>
To: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Friday, February 27, 2015 5:34:36 PM
Subject: Re: [LLVMdev] RFC: PerfGuide for frontend authors

The first version of this document is now live:
http://llvm.org/docs/Frontend/PerformanceTips.html

Please feel free to add to it directly. Alternatively, feel free to
reply to this thread with text describing an issue that should be
documented. I'll make sure text gets turned into patches.

First, thanks for working on this! Some things (perhaps) worth mentioning:

1. Make sure that a DataLayout is provided (this will likely become required in the near future, but is certainly important for optimization).

2. Add nsw/nuw/fast-math flags as appropriate

3. Add noalias/align/dereferenceable/nonnull to function arguments and return values as appropriate

4. Mark functions as readnone/readonly/nounwind when known (especially for external functions)

5. Use ptrtoint/inttoptr sparingly (they interfere with pointer aliasing analysis), prefer GEPs

6. Use the lifetime.start/lifetime.end and invariant.start/invariant.end intrinsics where possible

7. Use pointer aliasing metadata, especially tbaa metadata, to communicate otherwise-non-deducible pointer aliasing facts

8. Use the "most-private" possible linkage types for the functions being defined (private, internal or linkonce_odr preferably)

-Hal

I will second point #1 above. It bit me.

May I suggest that for each of these, something is said (ideally an example) of how to do these things using the API? It’s pretty straightforward when writing IR assembly, but not so obvious when you’re building up the IR using the various calls to methods defined in include/llvm/IR.

It might also be worthwhile to add information on how to actually invoke the optimizer and code generator programmatically. My code was based on the sources for llc in LLVM 3.2, but things have changed since then. If something as innocuous as forgetting to tweak something in a PassManager will result in correct, but suboptimal code generation then that’s worth knowing. (Point #1 is a good example. I had provided a DataLayout to the TargetMachine at code generation time, but I did not add it to the Module. Everything ran fine, nothing asserted, but certain obvious optimizations just did not happen.)

Speaking of using TBAA metadata, I notice that LangRef documents the old format, but doesn't explain "struct-path aware TBAA format", making it a challenge to either know how to generate it, or to read metadata that Clang has generated. If someone knowledgeable could update LangRef (in addition to supplying advice for the PerfGuide), it would be appreciated.

-Peter-

From: "Philip Reames" <listmail@philipreames.com>
To: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Friday, February 27, 2015 5:34:36 PM
Subject: Re: [LLVMdev] RFC: PerfGuide for frontend authors

The first version of this document is now live:
http://llvm.org/docs/Frontend/PerformanceTips.html

Please feel free to add to it directly. Alternatively, feel free to
reply to this thread with text describing an issue that should be
documented. I'll make sure text gets turned into patches.

First, thanks for working on this! Some things (perhaps) worth mentioning:

I'll add these Monday, but am not going to take the time to write much. Any expansion you (or anyone else) want to do would be welcome

1. Make sure that a DataLayout is provided (this will likely become required in the near future, but is certainly important for optimization).

2. Add nsw/nuw/fast-math flags as appropriate

3. Add noalias/align/dereferenceable/nonnull to function arguments and return values as appropriate

I was thinking of a more general: use metadata and function attributes.

I don't want to end up duplicating content from the Lang ref here. I was thinking that this page should cover the things you can't learn by just reading langref.

4. Mark functions as readnone/readonly/nounwind when known (especially for external functions)

5. Use ptrtoint/inttoptr sparingly (they interfere with pointer aliasing analysis), prefer GEPs

6. Use the lifetime.start/lifetime.end and invariant.start/invariant.end intrinsics where possible

Do you find these help in practice? The few experiments I ran were neutral at best and harmful in one or two cases. Do you have suggestions on how and when to use them?

I am using invariant.load, tbaas is constant flag, and a custom hook for zero initialized memory from my allocation routines.

From: "Philip Reames" <listmail@philipreames.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Friday, February 27, 2015 11:25:12 PM
Subject: Re: [LLVMdev] RFC: PerfGuide for frontend authors

>
>> From: "Philip Reames" <listmail@philipreames.com>
>> To: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
>> Sent: Friday, February 27, 2015 5:34:36 PM
>> Subject: Re: [LLVMdev] RFC: PerfGuide for frontend authors
>>
>> The first version of this document is now live:
>> http://llvm.org/docs/Frontend/PerformanceTips.html
>>
>> Please feel free to add to it directly. Alternatively, feel free
>> to
>> reply to this thread with text describing an issue that should be
>> documented. I'll make sure text gets turned into patches.
>
> First, thanks for working on this! Some things (perhaps) worth
> mentioning:
I'll add these Monday, but am not going to take the time to write
much. Any expansion you (or anyone else) want to do would be welcome

Thanks!

>
> 1. Make sure that a DataLayout is provided (this will likely become
> required in the near future, but is certainly important for
> optimization).
>
> 2. Add nsw/nuw/fast-math flags as appropriate
>
> 3. Add noalias/align/dereferenceable/nonnull to function arguments
> and return values as appropriate
I was thinking of a more general: use metadata and function
attributes.

I don't want to end up duplicating content from the Lang ref here. I
was thinking that this page should cover the things you can't learn
by just reading langref.

I agree, I don't want to duplicate the LangRef, but I think that mentioning some of the more-important attributes and metadata for optimization is useful. Unless you read the LangRef quite carefully, and also understand what the optimizer does, it is easy to miss which things are relevant to optimizations. I'd mention them here, and let people look at the LangRef for the definitions.

>
> 4. Mark functions as readnone/readonly/nounwind when known
> (especially for external functions)
>
> 5. Use ptrtoint/inttoptr sparingly (they interfere with pointer
> aliasing analysis), prefer GEPs
>
> 6. Use the lifetime.start/lifetime.end and
> invariant.start/invariant.end intrinsics where possible
Do you find these help in practice? The few experiments I ran were
neutral at best and harmful in one or two cases. Do you have
suggestions on how and when to use them?

Good point, we should be more specific here. My, admittedly limited, experience with these is that they're most useful when their properties are not dynamic -- which perhaps means that they post-dominate the entry, and are applied to allocas in the entry block -- and the larger the objects in question, the more the potential stack-space savings, etc.

I am using invariant.load, tbaas is constant flag, and a custom hook
for zero initialized memory from my allocation routines.

We should discuss this custom hook -- perhaps we'd also benefit from something similar upstream (from calloc, etc.). [Different thread however, I suppose].

-Hal

Hi,

> From: "Philip Reames" <listmail@philipreames.com>

> > 6. Use the lifetime.start/lifetime.end and
> > invariant.start/invariant.end intrinsics where possible
> Do you find these help in practice? The few experiments I ran were
> neutral at best and harmful in one or two cases. Do you have
> suggestions on how and when to use them?

Good point, we should be more specific here. My, admittedly limited,
experience with these is that they're most useful when their
properties are not dynamic -- which perhaps means that they
post-dominate the entry, and are applied to allocas in the entry block
-- and the larger the objects in question, the more the potential
stack-space savings, etc.

my experience adding support for the lifetime intrinsics to the rust
compiler is largely positive (because our code is very stack heavy at
the moment), but we still suffer from missed memcpy optimizations.
That happens because I made the lifetime regions as small as possible,
and sometimes an alloca starts its lifetime too late for the optimization
to happen. My new (but not yet implemented) approach to to "align" the
calls to lifetime.start for allocas with overlapping lifetimes unless
there's actually a possibility for stack slot sharing.

For example we currently translate:

    let a = [0; 1000000]; // Array of 1000000 zeros
    {
      let b = a;
    }
    let c = something;

to roughly this:

    lifetime.start(a)
    memset(a, 0, 1000000)
    
    lifetime.start(b)
    memcpy(b, a)
    lifetime.end(b)
    
    lifetime.start(c)
    lifetime.end(c)
    
    lifetime.end(a)

The lifetime.start call for "b" stops the call-slot (I think)
optimization from being applied. So instead this should be translated to
something like:

    lifetime.start(a)
    lifetime.start(b)
    memset(a, 0, 1000000)
    
    memcpy(b, a)
    lifetime.end(b)
    
    lifetime.start(c)
    lifetime.end(c)
    
    lifetime.end(a)

extending the lifetime of "b" because it overlaps with that of "a"
anyway. The lifetime of "c" still starts after the end of "b"'s lifetime
because there's actually a possibility for stack slot sharing.

Björn

I'd be interested in seeing the IR for this that you're currently generating. Unless I'm misreading your example, everything in this is completely dead. We should be able to reduce this to nothing and if we can't, it's clearly a missed optimization. I'm particularly interested in how the difference in placement of the lifetime start for 'b' effects optimization. I really wouldn't expect that.

Philip

I should have clarified that that was a reduced, incomplete example, the
actual code looks like this (after optimizations):

  define void @_ZN9test_func20hdd8a534ccbedd903paaE(i1 zeroext) unnamed_addr #0 {
  entry-block:
    %x = alloca [100000 x i32], align 4
    %1 = bitcast [100000 x i32]* %x to i8*
    %arg = alloca [100000 x i32], align 4
    call void @llvm.lifetime.start(i64 400000, i8* %1)
    call void @llvm.memset.p0i8.i64(i8* %1, i8 0, i64 400000, i32 4, i1 false)
    %2 = bitcast [100000 x i32]* %arg to i8*
    call void @llvm.lifetime.start(i64 400000, i8* %2) ; this happens too late
    call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %1, i64 400000, i32 4, i1 false)
    call void asm "", "r,~{dirflag},~{fpsr},~{flags}"([100000 x i32]* %arg) #2, !noalias !0, !srcloc !3
    call void @llvm.lifetime.end(i64 400000, i8* %2) #2, !alias.scope !4, !noalias !0
    call void @llvm.lifetime.end(i64 400000, i8* %2)
    call void @llvm.lifetime.end(i64 400000, i8* %1)
    ret void
  }

If the lifetime start for %arg is moved up, before the memset, the
callslot optimization can take place and the %c alloca is eliminated,
but with the lifetime starting after the memset, that isn't possible.

Björn

This bit of ir actually seems pretty reasonable given the inline asm. The only thing I really see is that the memcpy could be a memset. Are you expecting something else?

Philip

[This time without dropping the list, sorry]

Hi,

From: "Philip Reames" <listmail@philipreames.com>

6. Use the lifetime.start/lifetime.end and
invariant.start/invariant.end intrinsics where possible

Do you find these help in practice? The few experiments I ran were
neutral at best and harmful in one or two cases. Do you have
suggestions on how and when to use them?

Good point, we should be more specific here. My, admittedly limited,
experience with these is that they're most useful when their
properties are not dynamic -- which perhaps means that they
post-dominate the entry, and are applied to allocas in the entry block
-- and the larger the objects in question, the more the potential
stack-space savings, etc.

my experience adding support for the lifetime intrinsics to the rust
compiler is largely positive (because our code is very stack heavy at
the moment), but we still suffer from missed memcpy optimizations.
That happens because I made the lifetime regions as small as possible,
and sometimes an alloca starts its lifetime too late for the optimization
to happen. My new (but not yet implemented) approach to to "align" the
calls to lifetime.start for allocas with overlapping lifetimes unless
there's actually a possibility for stack slot sharing.

For example we currently translate:

   let a = [0; 1000000]; // Array of 1000000 zeros
   {
     let b = a;
   }
   let c = something;

to roughly this:

   lifetime.start(a)
   memset(a, 0, 1000000)
   lifetime.start(b)
   memcpy(b, a)
   lifetime.end(b)
   lifetime.start(c)
   lifetime.end(c)
   lifetime.end(a)

The lifetime.start call for "b" stops the call-slot (I think)
optimization from being applied. So instead this should be translated to
something like:

   lifetime.start(a)
   lifetime.start(b)
   memset(a, 0, 1000000)
   memcpy(b, a)
   lifetime.end(b)
   lifetime.start(c)
   lifetime.end(c)
   lifetime.end(a)

extending the lifetime of "b" because it overlaps with that of "a"
anyway. The lifetime of "c" still starts after the end of "b"'s lifetime
because there's actually a possibility for stack slot sharing.

Björn

I'd be interested in seeing the IR for this that you're currently
generating. Unless I'm misreading your example, everything in this is
completely dead. We should be able to reduce this to nothing and if we
can't, it's clearly a missed optimization. I'm particularly interested in
how the difference in placement of the lifetime start for 'b' effects
optimization. I really wouldn't expect that.

I should have clarified that that was a reduced, incomplete example, the
actual code looks like this (after optimizations):

define void @_ZN9test_func20hdd8a534ccbedd903paaE(i1 zeroext) unnamed_addr #0 {
entry-block:
   %x = alloca [100000 x i32], align 4
   %1 = bitcast [100000 x i32]* %x to i8*
   %arg = alloca [100000 x i32], align 4
   call void @llvm.lifetime.start(i64 400000, i8* %1)
   call void @llvm.memset.p0i8.i64(i8* %1, i8 0, i64 400000, i32 4, i1 false)
   %2 = bitcast [100000 x i32]* %arg to i8*
   call void @llvm.lifetime.start(i64 400000, i8* %2) ; this happens too late
   call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %1, i64 400000, i32 4, i1 false)
   call void asm "", "r,~{dirflag},~{fpsr},~{flags}"([100000 x i32]* %arg) #2, !noalias !0, !srcloc !3
   call void @llvm.lifetime.end(i64 400000, i8* %2) #2, !alias.scope !4, !noalias !0
   call void @llvm.lifetime.end(i64 400000, i8* %2)
   call void @llvm.lifetime.end(i64 400000, i8* %1)
   ret void
}

If the lifetime start for %arg is moved up, before the memset, the
callslot optimization can take place and the %c alloca is eliminated,
but with the lifetime starting after the memset, that isn't possible.

This bit of ir actually seems pretty reasonable given the inline asm. The only thing I really see is that the memcpy could be a memset. Are you expecting something else?

The only thing that is to be improved is that the memset should
directly write to %arg and %x should be removed because it is dead
then. This happens when there are no lifetime intrinsics or when the
call to lifetime.start is moved before the call to memset. The latter
is what my first mail was about, that it is usually better to have
overlapping lifetimes all start at the same point, instead of starting
them as late as possible.

Björn

Honestly, this sounds like a clear optimizer bug, not something a frontend should work around.

Can you file a bug with the four sets of ir? (Both schedules, no intrinsics before and after). This should hopefully be easy to fix.

Do you know of other cases like this with the lifetime intrinsics?

Philip

Do we have a mechanism for specifying an address for a global? The places I use inttoptr for runtime structures are all places outside of the JIT environment that I want to specify. Being able to create anonymous globals at a specified address would be very helpful.

David

+1 also interested in this answer.

-Josh