[OpenCL] Ideas for hierarchical/dynamic parallelism - enqueue_kernel in OpenCL 2.0


I am planning to propose an implementation of device side enqueue (DSE) - enqueue_kernel and related BIFs from OpenCL v2.0 s6.13.17.

It would be useful to get some feedback before if the general direction makes sense. We can of course discuss and tweak the details later on during code reviews.

The general steps would be as follows:

  1. Add enqueue_kernel, get_kernel_work_group_size and get_kernel_preferred_work_group_size_multiple as Clang builtins with a custom check to Builtins.def file.

Example: enqueue_kernel(…/ommited params/, block, /optional sizes of passed block args if any/)

This will allow diagnosing parameters of the passed block variable (the spec mandates them to be ‘local void*’ type) and we can check different overloads too (Table 6.31).

  1. Generate an internal library call in IR for each new builtins used in the CL code, reusing ObjC block generation.

For the following example of CL code:

kernel void device_side_enqueue(…) {

… /declare default_queue, flags, ndrange, a, b here/

enqueue_kernel(default_queue, flags, ndrange, ^(void) { a + b; });


The generated IR could be:

; from ObjC block CodeGen (the second field contains the size of the block literal record)

@__block_descriptor_tmp = internal constant { i64, i64, i8*, i8* } { i64 0, i64 52, i8* getelementptr inbounds ([35 x i8]* @.str, i32 0, i32 0), i8* null }

define void @device_side_enqueue() {

; from ObjC block CodeGen (block literal record with a capture)

%block = alloca <{ i8*, i32, i32, i8*, %struct.__block_descriptor*, i32, i32}>

; from ObjC block CodeGen - store block descriptor and block captures below

; from ObjC block CodeGen (set pointer to block definition code)

%block.invoke = getelementptr inbounds <{ i8*, i32, i32, i8*, %struct.__block_descriptor*, i32, i32}>* %block, i64 0, i32 3 *

store i8* bitcast (void (i8*)* @__device_side_enqueue_block_invoke to i8*), i8** %block.invoke *

; potential impl of OpenCL CodeGen (cast from block literal record ptr to void ptr)

%1 = bitcast <{ i8*, i32, i32, i8*, %struct.__block_descriptor*, i32, i32}>* %block to i8*

; potential impl of OpenCL CodeGen (this function will have additional integer params at the end if the block has any parameters to be passed to)

… call i32 @__enqueue_kernel_impl(…, i8* %1)


define internal void @__device_side_enqueue_block_invoke(i8* nocapture readonly %.block_descriptor) { ; from ObjC block CodeGen (this can have more params of local void* type)

; from ObjC block CodeGen - load captures below

; from ObjC block CodeGen - original body of block


Note that __enqueue_kernel_impl will have to be implemented as a part of an OpenCL runtime library which will get a block literal data structure (allocated locally as in this example if capture is present or as a global variable otherwise), sizes of each block literal parameter (from ‘local void*’ list) and other omitted arguments at the beginning - mainly opaque objects, and will perform necessary steps to enqueue work specified by the block. The block literal record itself contains all important bits to facilitate basic implementation of DSE: a pointer to a block function definition, captured fields, and size of the block literal record. We can also discuss and implement some optimisations later on or as a part of this work. The implementation of __enqueue_kernel_impl will have to take care of (1) initiating execution of the block invoke code pointed to by the block literal record (%block.invoke in the example above), (2) copying captured variables in the accessible memory location, (3) performing some sort of memory management to allocate space for ‘local void*’ parameters passed to the block if any.

  1. Modify ObjC blocks IR generation. A block literal record currently contains a number of fields that are not needed for OpenCL, i.e. isa, flags, copy and dispose helpers. They can be removed when compiling in OpenCL mode. We might potentially add extra fields to enable more efficient support of DSE or facilitate compiler optimisations. Ideas are welcome! I expect some places might require taking care of address spaces too.

  2. Potentially change existing OpenCL types. At least it seems like we might need to handle the ndrange_t type differently than we do currently. It’s an opaque type now, but we need it to be allocated on a stack because a local variable of that type can be declared in CL code.