RFC: Converting byref captures into bycopy

Hi all,

I have been looking at how to convert by-reference captures into by-copy captures for captured statements and possibly C++ lambdas, and am looking for some feedback on my approach. The motivation for trying to use copy captures is to avoid unnecessary loads that are otherwise required inside the outlined function. This can be important when the outlined function represents the body of a loop, and cannot be inlined, such as in cilk_for, or using a library-based parallel for with lambdas.

I have been prototyping an LLVM IR pass that can move loads out of a captured statement body when possible. The approach:

Use named metadata to find the captured statement helpers (or lambda functions), as well as the “kind” of captured region they represent.

e.g.

!capturedstmt.helper = !{!0, !1}

!0 = metadata !{, metadata !“cilk_for”}

!1 = metadata !{, metadata !“default”}

For each field in the implicit capture-struct parameter, determine whether it can be “promoted” to a by-copy capture. This involves

  1. checking the type of the field: only pointers to pointers or pointers to primitive types with size <= the original pointer can currently be promoted – this skips types that might require a copy constructor, and ensures that they are cheap to pass by value.

  2. looking at the uses of the field in the helper: if the field is used in any operation other than a load, then assume it cannot be promoted

  3. looking at the uses of the field in the call-site(s): if the pointer stored in the field may also be passed into the helper in another way, then it cannot be promoted. I have not implemented anything for this, but I imagine there are existing passes that would be useful here, such as alias analysis. For captured statements, there is only one call-site to worry about, but in lambdas all call-sites would need to be considered.

If any fields can be promoted, then clone the original function with a new capture struct parameter e.g. {i32**, i32*} → {i32*, i32}. Then replace loads of the original field with the value inside the outlined function. The call-site is updated to call the new function, and add loads of any arguments that have been promoted. These loads may be removed by later optimizations.

e.g.

%a = alloca i32

%context = alloca {i32*}

%field = getelementptr inbounds {i32*}* %context, i32 0, i32 0

store i32* %a, i32** %field

call void @__captured_stmt_helper({i32*}* %context)

define void @__captured_stmt_helper({i32*}* %context) {

%field = getelementptr inbounds {i32*}* %context, i32 0, i32 0

%load.field = load i32** %field

%a = load i32* %load.field

}

Becomes something like

%a = alloca i32

%context = alloca {i32}

%field = getelementptr inbounds {i32}* %context, i32 0, i32 0

%load.a = load i32* %a

store i32 %load.a, i32* %field

call void @__captured_stmt_helper_new({i32}* %context)

define void @__captured_stmt_helper_new({i32}* %context) {

%field = getelementptr inbounds {i32}* %context, i32 0, i32 0

%a = load i32* %field

}

I’d love to hear any feedback about this approach, since I’m not totally convinced yet this should not be done in Clang’s AST instead. Thanks,

Ben