About OpenCL 2.x Dynamic Parallelism

Hi,
Our lab is working on compiling opencl 2.x into NVPTX
But we have encountered some problems in dynamic parallelism. Blocks, particularly.
In short, clang would try to use Objective-C’s approach to compile blocks in CL code.
It would generate block_literal struct to hold block information and block_descriptor to hold captured variables.

My first question is:
There is a field in block_literal called “isa” representing the type of this block. But this field seems to always set to external symbols starts with “_NS”, which is Cocoa’s symbols. Is this necessary?

Second, clang put captured variables in block_descriptor and passed the entire block_literal, which also contains a field holding pointer to a block_descriptor instance, as a implicit “0 th” arguments for the real invoked function. But it seems that this approach doesn’t wok well with CL language since it’s hard to handling the address space of block_literal instances.
We have an idea passing all of the captured variables as function arguments, by value, for the invoked function. What do you folks think?

Last but not the least, Is there anyone working on implementing dynamic parallelism? There seems to be no discussions about that on this mailing list.

Best Regards,

Hi Bekket,

There is a field in block_literal called “isa” representing the type of this block. But this field seems to always set to external symbols starts with “_NS”, which is Cocoa’s symbols. Is this necessary?

This is not required for OpenCL, but we reuse ObjC implementation as much as possible at the moment. Of course, it doesn’t mean it can’t be change for OpenCL.

But it seems that this approach doesn’t work well with CL language since it’s hard to handling the address space of block_literal instances.

Could you elaborate here what your problem is exactly? Would adding address spaces in generated IR help?

Last but not the least, Is there anyone working on implementing dynamic parallelism? There seems to be no discussions about that on this mailing list.

We are working on this now. There were two commits to blocks diagnostics past weeks and I am planning to setup the review on enqueue_kernel builtin upcoming weeks. I am not aware of any work on blocks codegen to IR though. Is there anything in particular you need?

Also if you have any code you think might be useful to open source Clang that adds new functionality or improves existing work on this topic and you are happy to share, do let me know.

Cheers,

Anastasia

Anastasia Stulova <Anastasia.Stulova@arm.com> 於 2016年3月20日 下午5:37 寫道:

Hi Bekket,

> There is a field in block_literal called "isa" representing the type of this block. But this field seems to always set to external symbols starts with "_NS", which is Cocoa's symbols. Is this necessary?

This is not required for OpenCL, but we reuse ObjC implementation as much as possible at the moment. Of course, it doesn’t mean it can’t be change for OpenCL.

> But it seems that this approach doesn't work well with CL language since it's hard to handling the address space of block_literal instances.

Could you elaborate here what your problem is exactly? Would adding address spaces in generated IR help?

The current approach would transform block into block_literal struct, one of the fields, block_description, contains captured variables.
Another filed in block_literal is the function pointer to the “real” invoke function. The first parameter would always be the pointer to the block_literal instance, where the invoke function can use it to retrieve captured variables.
Here’s the problem: clang use alloca IR instruction to allocate space for block_literal instance. By default, it would be put on stack.
So if we move this situation to OpenCL kernel, alloca instruction would put block_literal instance into private address space by default. Then the block instance we pass to enqueue_kernel would need the pointer of block_literal, which is passed as the first argument of the invoke function, to access captured variables but result in failures since the pointer is in private address space.

> Last but not the least, Is there anyone working on implementing dynamic parallelism? There seems to be no discussions about that on this mailing list.

We are working on this now. There were two commits to blocks diagnostics past weeks and I am planning to setup the review on enqueue_kernel builtin upcoming weeks.

One of our ideas is to “flatten” all of the captured variables. That is, by the time a block variable is defined, copy all of the captured variables’s value into the invoke function. Since we can determine those variables’ value at that moment (“captured by the Block as const copies” as the spec say).

OpenCL-C’s block is slightly different from the normal one. First, every block variable is const, so each of them need to be defined upon declaration. Second, the “capture”(binding) actions are performed at the time block variables are defined. This would cause a behavior difference:
Here is the pseudo code:

int x = 1;
Block_t myBlock = ^(void)(void){
  print x+1;
};
x = 2;
myBlock();

In the normal circumstance, it would print 3. But in OpenCL-C, since we bind x as constant upon definition, it would print 2. That’s why we think our “flatten” approach could work: Copy captured variables as constants just one time would make a lot easier.

We’d just come out this idea few days ago, so we haven’t produce any useful code. We’re also considering using builtins to implement this idea.

I am not aware of any work on blocks codegen to IR though. Is there anything in particular you need?

Also if you have any code you think might be useful to open source Clang that adds new functionality or improves existing work on this topic and you are happy to share, do let me know.

Thank you very much.
We’re also interesting on your approach. Perhaps we can work on this topic together?

Cheers,

McClane

Anastasia Stulova <Anastasia.Stulova@arm.com> 於 2016年3月20日 下午5:37 寫道:

Hi Bekket,

> There is a field in block_literal called "isa" representing the type of this block. But this field seems to always set to external symbols starts with "_NS", which is Cocoa's symbols. Is this necessary?

This is not required for OpenCL, but we reuse ObjC implementation as much as possible at the moment. Of course, it doesn’t mean it can’t be change for OpenCL.

> But it seems that this approach doesn't work well with CL language since it's hard to handling the address space of block_literal instances.

Could you elaborate here what your problem is exactly? Would adding address spaces in generated IR help?

The current approach would transform block into block_literal struct, one of the fields, block_description, contains captured variables.
Another filed in block_literal is the function pointer to the “real” invoke function. The first parameter would always be the pointer to the block_literal instance, where the invoke function can use it to retrieve captured variables.
Here’s the problem: clang use alloca IR instruction to allocate space for block_literal instance. By default, it would be put on stack.
So if we move this situation to OpenCL kernel, alloca instruction would put block_literal instance into private address space by default. Then the block instance we pass to enqueue_kernel would need the pointer of block_literal, which is passed as the first argument of the invoke function, to access captured variables but result in failures since the pointer is in private address space.

In [Objective-]C, If the block is expected to persist beyond the lifetime of the caller, then the callee is expected to call _Block_copy to promote it to the heap. The compiler emits copy helpers (and descriptors for captured variables that have trivial copy semantics) that allow this to work with a little bit of support from the blocks runtime library.

For OpenCL, you may want to generalise this slightly to provide different target address spaces for the copy, but note that for __block to work correctly the target address space must be readable (and writeable) in the context of the caller. If you do not support __block variables then this is not an issue.

> Last but not the least, Is there anyone working on implementing dynamic parallelism? There seems to be no discussions about that on this mailing list.

We are working on this now. There were two commits to blocks diagnostics past weeks and I am planning to setup the review on enqueue_kernel builtin upcoming weeks.

One of our ideas is to “flatten” all of the captured variables. That is, by the time a block variable is defined, copy all of the captured variables’s value into the invoke function. Since we can determine those variables’ value at that moment (“captured by the Block as const copies” as the spec say).

OpenCL-C’s block is slightly different from the normal one. First, every block variable is const, so each of them need to be defined upon declaration. Second, the “capture”(binding) actions are performed at the time block variables are defined. This would cause a behavior difference:
Here is the pseudo code:

int x = 1;
Block_t myBlock = ^(void)(void){
  print x+1;
};
x = 2;
myBlock();

In the normal circumstance, it would print 3. But in OpenCL-C, since we bind x as constant upon definition, it would print 2. That’s why we think our “flatten” approach could work: Copy captured variables as constants just one time would make a lot easier.

We’d just come out this idea few days ago, so we haven’t produce any useful code. We’re also considering using builtins to implement this idea.

It sounds as if OpenCL’s requirements are much simpler than [Objective-]C’s. I looked at implementing flattening for blocks a few years ago, but it becomes quite complex when a single variable is bound to multiple blocks and the potential performance improvements did not justify the increased complexity. This is not an issue for you though.

It sounds as if OpenCL’s blocks are actually far closer to C++ lambdas with a default copy capture than they are to [Objective-]C blocks. It might be cleaner and simpler to treat them as special syntax for lambdas than as special semantics for blocks.

David

In [Objective-]C, If the block is expected to persist beyond the lifetime of the caller, then the callee is expected to call _Block_copy to promote it to the heap. The compiler emits copy helpers (and descriptors for captured variables that have trivial copy semantics) that allow this to work with a little bit of support from the blocks runtime library.

For OpenCL, you may want to generalise this slightly to provide different target address spaces for the copy,

I like this idea, but it might not so easy to allocate spaces in global or local address space and tell sub-kernel to access them without draining the storage space. Maybe a memory manager is required.

but note that for __block to work correctly the target address space must be readable (and writeable) in the context of the caller. If you do not support __block variables then this is not an issue.

Fortunately, OpenCL-C 2.x doesn’t allow __block attribute : )

It sounds as if OpenCL’s requirements are much simpler than [Objective-]C’s. I looked at implementing flattening for blocks a few years ago, but it becomes quite complex when a single variable is bound to multiple blocks and the potential performance improvements did not justify the increased complexity. This is not an issue for you though.

It sounds as if OpenCL’s blocks are actually far closer to C++ lambdas with a default copy capture than they are to [Objective-]C blocks. It might be cleaner and simpler to treat them as special syntax for lambdas than as special semantics for blocks

David

Cheers,

McClane

Hi Bekket,

In the normal circumstance, it would print 3. But in OpenCL-C, since we bind x as constant upon definition, it would print 2. That’s why we think our “flatten” approach could work: Copy captured variables as constants just one time would make a lot easier.

I am just not sure what you mean by “normal” here, because looking at ObjC code generation it does the following for the example similar to yours:

typedef int (^const Block_t)(void);

void foo(){

int x = 1;

Block_t myBlock = ^(void){

return x+1;

};

x = 2;

myBlock();

}

  1. Allocate and initialise x:

%x = alloca i32

store i32 1, i32* %x

  1. Allocate block on the stack:

%block = alloca <{ i8*, i32, i32, i8*, %struct.__block_descriptor*, i32 }>

  1. Store captures into the block fields:

%block.captured = getelementptr inbounds <{ i8*, i32, i32, i8*, %struct.__block_descriptor*, i32 }>, <{ i8*, i32, i32, i8*, %struct.__block_descriptor*, i32 }>* %block, i32 0, i32 5

%0 = load i32, i32* %x

store i32 %0, i32* %block.captured

  1. Assign the new value to x and the rest of the code:

store i32 2, i32* %x

  1. When we are calling the block with:

%call = call i32 %6(i8* %4)

The value of x will be used from the passed in capture field of the %block variable on the stack and its value is 1 (and not 2). I don’t see any difference between ObjC and OpenCL in this case.

Regarding enqueue_kernel, for the moment we plan to just pass the local stack block descriptor variable into the generated IR call. The implementation of enqueue_kernel builtin can decide how to proceed with the values of the stack variable containing all the useful flags and captures. This implementation would however imply it has to be copied in some global buffer to be accessible outside, which isn’t ideal. It would be nicer to avoid memory copies overall. However, it doesn’t seem easy without significant modification to IR codegen or adding some sort of simplified global heap allocation supported on device.

However, you are right considering the restrictions to blocks in OpenCL Spec we could modify generated IR to make it more efficient. If you could provide more details on your “flatten” approach or perhaps create the review/share some patches we could look into extending the upstream implementation provided that everyone finds it suitable.

Thanks,

Anastasia

Hi David,

It sounds as if OpenCL’s requirements are much simpler than [Objective-]C’s. I looked at implementing flattening for blocks a few years ago, but it becomes quite complex when a single variable is bound to multiple blocks and the potential performance improvements did not justify the increased complexity. This is not an issue for you though.

We can still have some sort of multiple binding via function parameters though, for example:

typedef int (^block_t)();

void f1(block_t bl) {
  int i = bl(); // this calls bl1, bl2 and anonymous blocks expr below
}

void f2(int i) {
  const block_t bl1 = ^{return i+1;};
  const block_t bl2 = ^{return i+2;};
  f1(bl1);
  f1(bl2);
  f1(^{return i+3;});
}

However, they are all known (can be deduced) at compile time. It still creates some complications. I am not sure how flattening would work for dynamic parallelism, but something to note, we can have non-compile time known captures too (see i in this example). And the sematic is to be able to spawn the enqueued blocks anywhere in the device with potentially isolated memory areas.

It sounds as if OpenCL’s blocks are actually far closer to C++ lambdas with a default copy capture than they are to [Objective-]C blocks. It might be cleaner and simpler to treat them as special syntax for lambdas than as special semantics for blocks.

Yes, the use of blocks in OpenCL is semantically closer to C++ lambdas. I wish there would be easier ways to share C and C++ parts in the compiler. We might still look into it later though.

Thanks,
Anastasia

I like this idea, but it might not so easy to allocate spaces in global or local address space and tell sub-kernel to access them without draining the storage space. Maybe a memory manager is required.

Yes, I see it as some sort of simplified global/local memory heap support would be required in implementation of enqueue_kernel on the device.