Lambdas and the ABI?

Hi Richard, Chandler, John, et al.,

"Quick" question: What aspects, if any, of a C++ lambda (e.g. size, layout, alignment) leak into the ABI or are (potentially) semantically visible?

Context...

We're investigating late lowering/outlining schemes to improve code generation for OpenMP and other parallel programming models. One important advantage of outlining late is that the IR-level optimizer still has access to the pointer-aliasing and loop information from the containing function. There are natural benefits to C++ lambdas as well. Lambdas that are used "in place" (i.e. only have one call site) should always be inlined, so the issues don't come up there, but for lambdas that have multiple call sites, or worse, escape (via std::function or some other type-erasure mechanism), we can get suboptimal optimization results for the body of the lambda. It would seem sad to fix this for OpenMP but not for lambdas.

However, optimizing before outlining could mean changes in what variables need to be captured (not semantically, but at the IR level), and so I'd like to think through what would constrain the optimizer's freedom to act in this regard.

John, I'm curious about how this might apply to Objective C blocks as well.

Thanks again,

Hal

This question was explored in an other context here:
http://sourcerytools.com/pipermail/cxx-abi-dev/2013-January/002544.html

– HT

Very interesting, thanks! So the conclusion was that, " The ABI will need to specify the layout of closure types with weak linkage." In some cases (enumerated here: ) the layout must be fixed (because it must be consistent across translation units). I imagine that size/alignment might also need to be fixed in cases where the source program explicitly depends on them (e.g. doing sizeof(lambda), new auto(lambda)). -Hal

Are you basing that on what the document does or the discussion in the thread? Because I don’t really see a conclusion in the thread. I see John pointing out three options and suggestion one which is not what you claim is the conclusion here:
http://sourcerytools.com/pipermail/cxx-abi-dev/2013-January/002544.html

I don’t see any real disagreement but I also don’t see any real discussion past that email and brief follow-ups…

Are you basing that on what the document does or the discussion in the
thread? Because I don't really see a conclusion in the thread. I see John
pointing out three options and suggestion one which is not what you claim
is the conclusion here:
http://sourcerytools.com/pipermail/cxx-abi-dev/2013-January/002544.html

I don't see any real disagreement but I also don't see any real discussion
past that email and brief follow-ups....

I expect John's #1 is most likely to be the eventual outcome; #2 and #3 are
too restrictive. Plus the language is moving in a direction that makes the
capture set much easier to reliably compute, so the cost of #1 is going to
become much more like "don't do those optimisations you weren't really
doing much anyway" and much less "accurately implement these arcane and
formally unobservable rules or you violate the ABI".

I don’t think that’s necessary for lambda types that do not escape a single TU. But obviously a portable program should avoid depending on the size or alignment of a closure type.

Hi Richard, Chandler, John, et al.,

"Quick" question: What aspects, if any, of a C++ lambda (e.g. size, layout, alignment) leak into the ABI or are (potentially) semantically visible?

C++ lambdas are anonymous local types that are constrained to appear in function bodies (including implicit "function" bodies, e.g. the initializers of global variables). The main semantic rule for local types like lambdas is that they're supposed to be the "same type" in all translation units that see the same function body, meaning (among many other things) that the ABI properties of the type must be the same. Now, most function bodies can only appear in only a single translation unit, so this rule is trivially satisfied and the compiler has total flexibility. But the body of an inline or template function can usually appear in multiple translation units, and so the ABI has to be consistent.

Context...

We're investigating late lowering/outlining schemes to improve code generation for OpenMP and other parallel programming models. One important advantage of outlining late is that the IR-level optimizer still has access to the pointer-aliasing and loop information from the containing function. There are natural benefits to C++ lambdas as well. Lambdas that are used "in place" (i.e. only have one call site) should always be inlined, so the issues don't come up there, but for lambdas that have multiple call sites, or worse, escape (via std::function or some other type-erasure mechanism), we can get suboptimal optimization results for the body of the lambda. It would seem sad to fix this for OpenMP but not for lambdas.

However, optimizing before outlining could mean changes in what variables need to be captured (not semantically, but at the IR level), and so I'd like to think through what would constrain the optimizer's freedom to act in this regard.

"Subfunction" representations are very useful for coroutine-esque situations where an inner function's control flow interactions with the outer function are essentially well-understood. That is, to be useful, you really want the blocks of the inner function to fit into the CFG of the outer function at a unique and appropriate place; otherwise you're just contorting the representation of the outer function in a way that will probably massively hamper optimization.

I can definitely see how there might be many situations in OpenMP that have this kind of CFG knowledge. Swift has some situations like this as well. Unfortunately, C++ lambdas and ObjC blocks are not a case where we have this knowledge, because the closure can be passed around and invoked arbitrarily many times and at arbitrary later points, which is not something that can ever fit cleanly into the CFG; pre-emptive "outlining" is just a superior representation for such cases.

John.

The follow-up message is not linked properly in the web archive: - Hubert wrote, " CWG has taken the position that issue CA 2 on this topic in N3771 is NAD. The ABI will need to specify the layout of closure types with weak linkage." -Hal

Hi Richard, Chandler, John, et al.,

"Quick" question: What aspects, if any, of a C++ lambda (e.g. size, layout, alignment) leak into the ABI or are (potentially) semantically visible?

C++ lambdas are anonymous local types that are constrained to appear in function bodies (including implicit "function" bodies, e.g. the initializers of global variables). The main semantic rule for local types like lambdas is that they're supposed to be the "same type" in all translation units that see the same function body, meaning (among many other things) that the ABI properties of the type must be the same. Now, most function bodies can only appear in only a single translation unit, so this rule is trivially satisfied and the compiler has total flexibility. But the body of an inline or template function can usually appear in multiple translation units, and so the ABI has to be consistent.

Context...

We're investigating late lowering/outlining schemes to improve code generation for OpenMP and other parallel programming models. One important advantage of outlining late is that the IR-level optimizer still has access to the pointer-aliasing and loop information from the containing function. There are natural benefits to C++ lambdas as well. Lambdas that are used "in place" (i.e. only have one call site) should always be inlined, so the issues don't come up there, but for lambdas that have multiple call sites, or worse, escape (via std::function or some other type-erasure mechanism), we can get suboptimal optimization results for the body of the lambda. It would seem sad to fix this for OpenMP but not for lambdas.

However, optimizing before outlining could mean changes in what variables need to be captured (not semantically, but at the IR level), and so I'd like to think through what would constrain the optimizer's freedom to act in this regard.

"Subfunction" representations are very useful for coroutine-esque situations where an inner function's control flow interactions with the outer function are essentially well-understood. That is, to be useful, you really want the blocks of the inner function to fit into the CFG of the outer function at a unique and appropriate place; otherwise you're just contorting the representation of the outer function in a way that will probably massively hamper optimization.

I can definitely see how there might be many situations in OpenMP that have this kind of CFG knowledge. Swift has some situations like this as well. Unfortunately, C++ lambdas and ObjC blocks are not a case where we have this knowledge, because the closure can be passed around and invoked arbitrarily many times and at arbitrary later points, which is not something that can ever fit cleanly into the CFG; pre-emptive "outlining" is just a superior representation for such cases.

This is a very good point. We only know that the lambda will execute, and perhaps execute multiple times, after some given point. The same is true (to some extent) for some OpenMP features. For example, we can generate OpenMP tasks in a function, and we know only that they'll run sometime before the end of the containing parallel region, taskwait directive, etc. This might be outside the scope of the function.

One interesting question is: To what extent does this affect our ability use analysis (AA, SCEV, etc.) of the variables at point of capture to optimize the body of the "subfunction." The aliasing properties of pointers, for example, and the bounds of captured induction variables, should be fine. However, if there's anything that might be rendered invalid by some later change in state (e.g. a write to memory), that wouldn't be usable. We also need to make sure that pointers appropriately appear to be captured (just as they would be if we used an outlined function representation). Luckily, I suppose, we need to do that anyway.

Thanks again,
Hal

Right. I would say that there are three interesting levels of closure-capture optimization:

1. We don't know when the inner function will be called; the outer function's frame may not still be intact. This is generally true of blocks and lambdas because they can be copied off the stack. If the block/lambda captures a reference to something in the outer function, U.B. might let us assume that the frame still exists, but I think only if the block/lambda actually uses that captured reference. Richard, let us know if there's some special rule that would help us here.

It's hard to see what optimizations would be possible here. Maybe, if we controlled the copying process, we could delay copying captured values into the block/lambda temporary, and instead just access them directly in the inner function? But the inner function would have to be compiled two ways, one for when the variables have been copied and one when they're still in-place. This is a lot of extra complexity.

2. We don't know exactly when or how often the inner function will be called, but there's a definite point in the outer function past which the inner function will no longer be used. This is true of some OpenMP features, "noescape" blocks in ObjC, and noescape blocks/closures in Swift.

We can optimize this by eliminating copies and indirections: instead of aggressively copying values and local addresses into the block/lambda, we can just store the enclosing frame pointer. If a capture is semantically "by reference", i.e. changes to it in the inner function are reflected in the outer and vice-versa, then we have to temporarily pin it into memory and treat its address as having escaped; we can then just access it relative to the captured frame pointer. If a capture is semantically "by value", i.e. it copies the value of a variable at a specific point, then we just have to keep that current value live for a while in the outer frame; in many cases, we can prove that neither function will modify it, so this can safely degrade to the by-reference implementation.

3. We know exactly when the inner function will be called and it can be fit into the outer function's CFG. This is true of some OpenMP features and some very specific Swift features.

We can still do normal data-flow and control-flow analyses between the outer and inner functions; the only constraint is that we (probably) have to treat the beginning and end of the inner function as opaque barriers. So the unlowered function looks something like:

  %initial_result = call %initial_result_t @begin_subfunction(%initial_args_t %initial_args)
  // subfunction goes here
  %final_result = call %final_result_t @end_subfunction(%final_args_t %final_args)

and eventually that gets lowered into something like:

  %final_result = call %final_result_t @do_with_subfunction(%initial_args, @subfunction)

  define %final_args_t @subfunction(%initial_result_t %initial_result) {
    // subfunction goes here
    ret %final_args_t %final_args
  }

with data flow in and out handled basically like in case #2. This is one class of thing that people want to be able to do with the coroutine proposal.

John.

I'd also throw in that our implementation of MSVC EH is a good example of
this type of inner function outlining in LLVM. The token-producing block
leader and terminator instructions create single-entry single-exit regions
that we can outline into funclets as part of CodeGen.

By keeping the funclets in the normal CFG of the parent function, we are
able to SROA aggregates with destructors and do things like heap-to-stack
conversion of std::unique_ptr.

#include <memory>
void may_throw(int);
void foo() {
  std::unique_ptr<int> p(new int(42));
  may_throw(*p);
}

Getting back to Hal's original question, I think that these kinds of
optimizations are impossible for C++ lambdas. They frequently appear in
headers and it's really easy to leak the class type with decltype and auto,
and that class definition better be the same across all TUs.

Very good point.

It’s okay for optimizations to only apply to “special cases”, especially when those special cases are likely to be dominant in practice. Optimizing lambdas would absolutely be worthwhile even if we had to not apply the optimizations to lambdas in inline functions.

John.

>>> Hi Richard, Chandler, John, et al.,
>>>
>>> "Quick" question: What aspects, if any, of a C++ lambda (e.g. size,
layout, alignment) leak into the ABI or are (potentially) semantically
visible?
>> C++ lambdas are anonymous local types that are constrained to appear in
function bodies (including implicit "function" bodies, e.g. the
initializers of global variables). The main semantic rule for local types
like lambdas is that they're supposed to be the "same type" in all
translation units that see the same function body, meaning (among many
other things) that the ABI properties of the type must be the same. Now,
most function bodies can only appear in only a single translation unit, so
this rule is trivially satisfied and the compiler has total flexibility.
But the body of an inline or template function can usually appear in
multiple translation units, and so the ABI has to be consistent.
>>
>>> Context...
>>>
>>> We're investigating late lowering/outlining schemes to improve code
generation for OpenMP and other parallel programming models. One important
advantage of outlining late is that the IR-level optimizer still has access
to the pointer-aliasing and loop information from the containing function.
There are natural benefits to C++ lambdas as well. Lambdas that are used
"in place" (i.e. only have one call site) should always be inlined, so the
issues don't come up there, but for lambdas that have multiple call sites,
or worse, escape (via std::function or some other type-erasure mechanism),
we can get suboptimal optimization results for the body of the lambda. It
would seem sad to fix this for OpenMP but not for lambdas.
>>>
>>> However, optimizing before outlining could mean changes in what
variables need to be captured (not semantically, but at the IR level), and
so I'd like to think through what would constrain the optimizer's freedom
to act in this regard.
>> "Subfunction" representations are very useful for coroutine-esque
situations where an inner function's control flow interactions with the
outer function are essentially well-understood. That is, to be useful, you
really want the blocks of the inner function to fit into the CFG of the
outer function at a unique and appropriate place; otherwise you're just
contorting the representation of the outer function in a way that will
probably massively hamper optimization.
>>
>> I can definitely see how there might be many situations in OpenMP that
have this kind of CFG knowledge. Swift has some situations like this as
well. Unfortunately, C++ lambdas and ObjC blocks are not a case where we
have this knowledge, because the closure can be passed around and invoked
arbitrarily many times and at arbitrary later points, which is not
something that can ever fit cleanly into the CFG; pre-emptive "outlining"
is just a superior representation for such cases.
>
> This is a very good point. We only know that the lambda will execute,
and perhaps execute multiple times, after some given point. The same is
true (to some extent) for some OpenMP features. For example, we can
generate OpenMP tasks in a function, and we know only that they'll run
sometime before the end of the containing parallel region, taskwait
directive, etc. This might be outside the scope of the function.
>
> One interesting question is: To what extent does this affect our ability
use analysis (AA, SCEV, etc.) of the variables at point of capture to
optimize the body of the "subfunction." The aliasing properties of
pointers, for example, and the bounds of captured induction variables,
should be fine. However, if there's anything that might be rendered invalid
by some later change in state (e.g. a write to memory), that wouldn't be
usable. We also need to make sure that pointers appropriately appear to be
captured (just as they would be if we used an outlined function
representation). Luckily, I suppose, we need to do that anyway.

Right. I would say that there are three interesting levels of
closure-capture optimization:

1. We don't know when the inner function will be called; the outer
function's frame may not still be intact. This is generally true of blocks
and lambdas because they can be copied off the stack. If the block/lambda
captures a reference to something in the outer function, U.B. might let us
assume that the frame still exists, but I think only if the block/lambda
actually uses that captured reference. Richard, let us know if there's
some special rule that would help us here.

Any use of a non-reference captured by reference would require the captured
variable to still exist, and so we can in principle use a pointer to the
outer function's stack frame for those captures (but not by-copy captures
nor reference captures of references). But we can't assume that the outer
function's stack frame will be active when such a lambda is called, only
when such a capture is used.

It's hard to see what optimizations would be possible here. Maybe, if we

controlled the copying process, we could delay copying captured values into
the block/lambda temporary, and instead just access them directly in the
inner function? But the inner function would have to be compiled two ways,
one for when the variables have been copied and one when they're still
in-place. This is a lot of extra complexity.

One possibly-interesting optimization would be constant propagation of
captures: we statically know at the point where we emit the construction of
the lambda and its body what all the uses of each capture are, so this
isn't the normal somewhat-unbounded problem of constant propagation through
class members.

Right, that’s what I was afraid of. That does add an extra obstacle for optimizing lambda representations; I wonder how much of a problem it would be in practice.

Yeah. There are also a number of possible minor optimizations where one value is known to be re-derivable from another.

John.