OpenCL support

Peter Collingbourne peter at pcc.me.uk
Fri Nov 26 12:21:18 CST 2010

tagging with metadata the alloca instructions for __local variables

The way I planned to handle this was to give __local variables with
function scope a static storage-class, so they would be codegen'd into
global variables rather than alloca instructions. This makes the
implementation easier on the LLVM side as we don't yet have support
for address space attributes on alloca instructions (remember that
pointers to __local variables must also be __local). I implemented
a patch [2] for this, but decided to wait until the OpenCL semantic
support in Clang was more mature.

Thanks,
--
Peter

[1] http://lists.cs.uiuc.edu/pipermail/cfe-dev/attachments/20101125/b306471a/attachment-0006.bin
[2] http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20101018/035558.html

Peter, I believe it is incorrect to make __local variables static and
therefore codegen'd into global variables. The reason is that the
storage for a __local variable is shared between different work items
in the same group, but should be different for work items in different
groups.

For consider this:

  __kernel foo(__global int *A) {

     __local x;

     ...

  }

If this kernel is executed as an NDRange with more than one work group
executed in parallel, then the storage for "x" should not be shared
between the two different work groups. But it should be shared within
the same work group.

I don't know how existing OpenCL implementations handle this case. It
seems it would be a good idea to transform the code so that uses of x
become loads and stores from memory, and the address for that memory
is returned by a builtin function that itself is dependent on work
group ids.

I'm just learning Clang now, so I'm not prepared to say how that would
be done. Is it okay to transform the AST before semantic analysis?
Where should I start looking? (I would guess lib/Sema...)

regards,

David Neto

Hi David,

Peter, I believe it is incorrect to make __local variables static and
therefore codegen'd into global variables. The reason is that the
storage for a __local variable is shared between different work items
in the same group, but should be different for work items in different
groups.

...

I don't know how existing OpenCL implementations handle this case.

Most GPU architectures have a separate address space for memory shared
within a work group, where a given logical memory address corresponds
to a different physical address dependent on the work group. Existing
OpenCL implementations for GPUs handle this case simply by allocating
__local variables as global variables within this address space.

It
seems it would be a good idea to transform the code so that uses of x
become loads and stores from memory, and the address for that memory
is returned by a builtin function that itself is dependent on work
group ids.

I'm just learning Clang now, so I'm not prepared to say how that would
be done. Is it okay to transform the AST before semantic analysis?
Where should I start looking? (I would guess lib/Sema...)

This transformation may be useful for a CPU based OpenCL
implementation, but would not be appropriate in Sema for a few
reasons. The first is that the AST should at all times be an accurate
representation of the input source code.

The second is that such a transformation would be specific to the
OpenCL implementation -- not only would it be inappropriate for
GPUs but there are a number of feasible CPU based implementation
techniques which we shouldn't have to teach Sema or in fact any part
of Clang about.

The best place to do this transformation would be at the LLVM level
with an implementation specific transformation pass.

Thanks,

Peter,

Thanks for your informative reply. I appreciate the advice about the
high level intent of Sema.

Hi David,

Peter, I believe it is incorrect to make __local variables static and
therefore codegen'd into global variables. The reason is that the
storage for a __local variable is shared between different work items
in the same group, but should be different for work items in different
groups.

...

I don't know how existing OpenCL implementations handle this case.

Most GPU architectures have a separate address space for memory shared
within a work group, where a given logical memory address corresponds
to a different physical address dependent on the work group. Existing
OpenCL implementations for GPUs handle this case simply by allocating
__local variables as global variables within this address space.

Ah. Thanks for this tidbit about GPUs.

It
seems it would be a good idea to transform the code so that uses of x
become loads and stores from memory, and the address for that memory
is returned by a builtin function that itself is dependent on work
group ids.

I'm just learning Clang now, so I'm not prepared to say how that would
be done. Is it okay to transform the AST before semantic analysis?
Where should I start looking? (I would guess lib/Sema...)

This transformation may be useful for a CPU based OpenCL
implementation, but would not be appropriate in Sema for a few
reasons. The first is that the AST should at all times be an accurate
representation of the input source code.

The second is that such a transformation would be specific to the
OpenCL implementation -- not only would it be inappropriate for
GPUs but there are a number of feasible CPU based implementation
techniques which we shouldn't have to teach Sema or in fact any part
of Clang about.

The best place to do this transformation would be at the LLVM level
with an implementation specific transformation pass.

Ok. Now I'm even more convinced that your patch [1] is incorrect because:
(a) it's specific to GPU-style implementations of OpenCL, not the
generic semantics of OpenCL.
(b) it pushes target-specific assumptions into Sema. But you've just
argued that the AST should reflect the original source code as much as
possible.

On (a): I understand that ARM is preparing to contribute a more
complete OpenCL front-end to Clang. It would be great to nail down a
common front end with generic OpenCL semantics, and let later stages
(Clang's CodeGen? LLVM IR pass?) handle more target-specific
assumptions. E.g. it would be nice to standardize on how Clang
handles OpenCL's local, global, etc. etc. etc. E.g. just agreeing on
address space numbering would be a step forward. (e.g. global is 1,
local is 2...)

What do I think your patch should look like? It's true that the
diag::err_as_qualified_auto_decl is inappropriate for OpenCL when it's
the __local addres space.

But we need to implement the semantics somehow. Conceptually I think
of it as a CL source-to-source transformation that lowers
function-scope-local-address-space variables into a more primitive
form.

I think I disagree that the Clang is an inappropriate spot for
implementing this type of transform: Clang "knows" the source language
semantics, and has a lot of machinery required for the transform.
Also, Clang also knows a lot about the target machine (e.g. type
sizes, builtins, more?).

So I believe the "auto var in different address space" case should be
allowed in the AST in the OpenCL case, and the local-lowering
transform should be applied in CodeGen. Perhaps the lowering is
target-specific, e.g. GPU-style, or more generic style as I proposed.

Thoughts?

Thanks,
--
Peter

[1] http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20101018/035558.html

Hi David,

>> It
>> seems it would be a good idea to transform the code so that uses of x
>> become loads and stores from memory, and the address for that memory
>> is returned by a builtin function that itself is dependent on work
>> group ids.
>>
>> I'm just learning Clang now, so I'm not prepared to say how that would
>> be done. Is it okay to transform the AST before semantic analysis?
>> Where should I start looking? (I would guess lib/Sema...)
>
> This transformation may be useful for a CPU based OpenCL
> implementation, but would not be appropriate in Sema for a few
> reasons. The first is that the AST should at all times be an accurate
> representation of the input source code.
>
> The second is that such a transformation would be specific to the
> OpenCL implementation -- not only would it be inappropriate for
> GPUs but there are a number of feasible CPU based implementation
> techniques which we shouldn't have to teach Sema or in fact any part
> of Clang about.
>
> The best place to do this transformation would be at the LLVM level
> with an implementation specific transformation pass.

Ok. Now I'm even more convinced that your patch [1] is incorrect because:
(a) it's specific to GPU-style implementations of OpenCL, not the
generic semantics of OpenCL.
(b) it pushes target-specific assumptions into Sema. But you've just
argued that the AST should reflect the original source code as much as
possible.

Yes, that's why I don't like the patch so much :slight_smile: It was really
designed to work with the current infrastructure, which isn't
very well suited to more exotic languages like OpenCL.

On (a): I understand that ARM is preparing to contribute a more
complete OpenCL front-end to Clang. It would be great to nail down a
common front end with generic OpenCL semantics, and let later stages
(Clang's CodeGen? LLVM IR pass?) handle more target-specific
assumptions. E.g. it would be nice to standardize on how Clang
handles OpenCL's local, global, etc. etc. etc. E.g. just agreeing on
address space numbering would be a step forward. (e.g. global is 1,
local is 2...)

+llvmdev, as this is also a LLVM-relevant issue.

I agree. We should set a standard for address spaces in LLVM - a low
range for 'standard' address spaces (with a defined semantics for each
value in that range) and a high range for target-specific spaces.
It looks like address spaces are already being used this way to a
certain extent in the targets (X86 uses 256 -> GS, 257 -> FS). And
I think 256 'standard' address spaces should be enough, but I'm happy
to be proven wrong :slight_smile:

What do I think your patch should look like? It's true that the
diag::err_as_qualified_auto_decl is inappropriate for OpenCL when it's
the __local addres space.

But we need to implement the semantics somehow. Conceptually I think
of it as a CL source-to-source transformation that lowers
function-scope-local-address-space variables into a more primitive
form.

I think I disagree that the Clang is an inappropriate spot for
implementing this type of transform: Clang "knows" the source language
semantics, and has a lot of machinery required for the transform.
Also, Clang also knows a lot about the target machine (e.g. type
sizes, builtins, more?).

So I believe the "auto var in different address space" case should be
allowed in the AST in the OpenCL case, and the local-lowering
transform should be applied in CodeGen. Perhaps the lowering is
target-specific, e.g. GPU-style, or more generic style as I proposed.

Thoughts?

I've been rethinking this and perhaps coming around to this way
of thinking. Allocating variables in the __local address space
is really something that can't be represented at the LLVM level,
at least in a standard form.

But to a certain extent both auto and static storage-classes are wrong
here. Auto implies that each invocation of the function gets its own
variable, while static implies that all invocations share a variable.

Perhaps the right thing to do here is to introduce a new storage-class
for __local variables (let's call it the 'wg-local' storage-class).
A variable cannot be made wg-local with a storage-class specifier but
function-scope-local-address-space variables would be made so in a
similar way to my original patch. The target would then be required
to define at CodeGen the semantics of declaring wg-local variables and
loading and storing from local address space in the way you propose.

A side effect of this is that we will require a mapping of
target-unsupported address spaces to supported address spaces in
CodeGen. For example, a CPU based implementation should map the
local address space to 0.

Thanks,

From: llvmdev-bounces@cs.uiuc.edu [mailto:llvmdev-bounces@cs.uiuc.edu]
On Behalf Of Peter Collingbourne
Sent: Monday, December 06, 2010 2:56 PM
To: David Neto
Cc: cfe-dev@cs.uiuc.edu; llvmdev@cs.uiuc.edu
Subject: Re: [LLVMdev] [cfe-dev] OpenCL support

Hi David,

> >> It
> >> seems it would be a good idea to transform the code so that uses
of x
> >> become loads and stores from memory, and the address for that
memory
> >> is returned by a builtin function that itself is dependent on work
> >> group ids.
> >>
> >> I'm just learning Clang now, so I'm not prepared to say how that
would
> >> be done. Is it okay to transform the AST before semantic
analysis?
> >> Where should I start looking? (I would guess lib/Sema...)
> >
> > This transformation may be useful for a CPU based OpenCL
> > implementation, but would not be appropriate in Sema for a few
> > reasons. The first is that the AST should at all times be an
accurate
> > representation of the input source code.
> >
> > The second is that such a transformation would be specific to the
> > OpenCL implementation -- not only would it be inappropriate for
> > GPUs but there are a number of feasible CPU based implementation
> > techniques which we shouldn't have to teach Sema or in fact any
part
> > of Clang about.
> >
> > The best place to do this transformation would be at the LLVM level
> > with an implementation specific transformation pass.
>
> Ok. Now I'm even more convinced that your patch [1] is incorrect
because:
> (a) it's specific to GPU-style implementations of OpenCL, not the
> generic semantics of OpenCL.
> (b) it pushes target-specific assumptions into Sema. But you've just
> argued that the AST should reflect the original source code as much
as
> possible.

Yes, that's why I don't like the patch so much :slight_smile: It was really
designed to work with the current infrastructure, which isn't
very well suited to more exotic languages like OpenCL.

> On (a): I understand that ARM is preparing to contribute a more
> complete OpenCL front-end to Clang. It would be great to nail down
a
> common front end with generic OpenCL semantics, and let later stages
> (Clang's CodeGen? LLVM IR pass?) handle more target-specific
> assumptions. E.g. it would be nice to standardize on how Clang
> handles OpenCL's local, global, etc. etc. etc. E.g. just agreeing on
> address space numbering would be a step forward. (e.g. global is 1,
> local is 2...)

+llvmdev, as this is also a LLVM-relevant issue.

I agree. We should set a standard for address spaces in LLVM - a low
range for 'standard' address spaces (with a defined semantics for each
value in that range) and a high range for target-specific spaces.
It looks like address spaces are already being used this way to a
certain extent in the targets (X86 uses 256 -> GS, 257 -> FS). And
I think 256 'standard' address spaces should be enough, but I'm happy
to be proven wrong :slight_smile:

[Villmow, Micah] This would be very beneficial to define these. The only main issue is the
default address space. In OpenCL it is private, in LLVM, it is closer to global than private.

> What do I think your patch should look like? It's true that the
> diag::err_as_qualified_auto_decl is inappropriate for OpenCL when
it's
> the __local addres space.
>
> But we need to implement the semantics somehow. Conceptually I think
> of it as a CL source-to-source transformation that lowers
> function-scope-local-address-space variables into a more primitive
> form.
>
> I think I disagree that the Clang is an inappropriate spot for
> implementing this type of transform: Clang "knows" the source
language
> semantics, and has a lot of machinery required for the transform.
> Also, Clang also knows a lot about the target machine (e.g. type
> sizes, builtins, more?).
>
> So I believe the "auto var in different address space" case should be
> allowed in the AST in the OpenCL case, and the local-lowering
> transform should be applied in CodeGen. Perhaps the lowering is
> target-specific, e.g. GPU-style, or more generic style as I proposed.
>
> Thoughts?

I've been rethinking this and perhaps coming around to this way
of thinking. Allocating variables in the __local address space
is really something that can't be represented at the LLVM level,
at least in a standard form.

[Villmow, Micah] We ran across this problem in our OpenCL implementation. However, you can create a global variable with an '__local' address space and it works fine. There is an issue with collision between auto-arrays in different kernels, but that can be solved with a little name mangling. There are other ways to do this, for example, by converting local auto-arrays into kernel local pointer arguments with a known size.

But to a certain extent both auto and static storage-classes are wrong
here. Auto implies that each invocation of the function gets its own
variable, while static implies that all invocations share a variable.

Perhaps the right thing to do here is to introduce a new storage-class
for __local variables (let's call it the 'wg-local' storage-class).
A variable cannot be made wg-local with a storage-class specifier but
function-scope-local-address-space variables would be made so in a
similar way to my original patch. The target would then be required
to define at CodeGen the semantics of declaring wg-local variables and
loading and storing from local address space in the way you propose.

[Villmow, Micah] I'd move away from adding a new storage class as using the address space alone is sufficient to handle OpenCL's __local address space.

Here's a little example to show the direction I was heading, with an
illustration as a CL-to-C translation. I believe there are no
namespace issues, but otherwise is essentially the same as the global
variable solution.

The idea is that the func scope local addr variables are like a stack
frame that is shared between the different work items in a group. So
collect all those variables in an anonymous struct, and then create a
function scope private variable to point to the one copy of that
struct. The pointer is returned by a system-defined intrinsic
function dependent on the current work item. (The system knows what
work groups are in flight, which is why you need a system-defined
intrinsic.)

So a kernel function like this:

void foo(__global int*A) {
   __local int vint;
   __local int *vpint;
   __local int const *vcpint;
   __local int volatile vvint;
   int a = A[0];
   vint = a;
   vvint = a;
   int a2 = vint;
   int va2 = vvint;
   barrier(CLK_LOCAL_MEM_FENCE);
   A[0] = a2 + va2;
}

is translated to this, which does pass through Clang, with __local
meaning attrib addrspace(2):

extern __local void * __get_work_group_local_base_addr(void); // intrinsic
void foo(__global int*A) {
   __local struct __local_vars_s {
      int vint;
      int *vpint;
      int const *vcpint;
      int volatile vvint;
   } * const __local_vars
            // this is a *private* variable, pointing to *local* addresses.
            // it's a const pointer because it shouldn't change; and
being const may expose optimizations
       = __get_work_group_local_base_addr(); // the new intrinsic
   int a = A[0];
   __local_vars->vint = a; // l-values are translated as memory stores.
   __local_vars->vvint = a;
   int a2 = __local_vars->vint; // r-values are translated as memory loads
   int va2 = __local_vars->vvint;
   barrier(CLK_LOCAL_MEM_FENCE);
   A[0] = a2 + va2;
}

As an extension, the backend ought to be able to use some smarts to
simplify this down in simple cases. For example if the system only
ever allows one work group at a time, then the intrinsic could boil
down to returning a constant, and then link time optimization can
scrub away unneeded work. Similarly if you have a GPU style
environment where (as Peter described) the "local" addresses are the
same integer value but in different groups point to different storage,
then again the intrinsic returns a constant and again LTO optimizes
the result.

I haven't thought through the implications of a kernel having such
vars calling another kernel having such variables. At least the
OpenCL spec says that the behaviour is implementation-defined for such
a case. It would be nice to be able to represent any of the sane
possibilities.

@Anton: Regarding ARM's open-sourcing: I'm glad to see the
reaffirmation, and I look forward to the contribution. Yes, I
understand the virtues of patience. :slight_smile:
I assume you plan to commit a document describing how OpenCL is
supported. (e.g. how details like the above are handled.)

thanks,
david

I would reconsider Micah’s suggestion. The simple solution is to tag the variable with an address space and turn it into a global. You can do that with a simple change in CodeGenFunction::CreateStaticBlockVarDecl. It would give all the benefits you describe in that the target decides how to lower the code but do it using concepts that LLVM and some targets may already support. Kernels that call kernels with locals will also work.

Perhaps an example is useful? Our OpenCL implementation, given the code above, generates this bitcode (after optimizations that have eliminated the dead vars):

target datalayout = “e-p:32:32:32-f64:64:64-i64:64:64”
target triple = “zms-ziilabs-opencl10”

@foo.auto.vint = internal addrspace(2) global i32 0, align 4
@foo.auto.vvint = internal addrspace(2) global i32 0, align 4

define void @foo(i32 addrspace(1)* %A) nounwind {
entry:
%tmp1 = load i32 addrspace(1)* %A, align 4
store i32 %tmp1, i32 addrspace(2)* @foo.auto.vint, align 4
volatile store i32 %tmp1, i32 addrspace(2)* @foo.auto.vvint, align 4
%tmp7 = volatile load i32 addrspace(2)* @foo.auto.vvint, align 4
tail call void @llvm.memory.barrier(i1 true, i1 true, i1 true, i1 true, i1 false)
%add = add nsw i32 %tmp7, %tmp1
store i32 %add, i32 addrspace(1)* %A, align 4
ret void
}

Krister

Thanks for the real compilation output.

But it does not support an environment in which two work groups are
being executed at the same time. The work groups should be isolated
from each other, and so should have different storage for each of
those variables.

For example if we have two work groups with 4 work items each and
everything is run in parallel, then we should have two storage
locations for "vint". The first copy will be shared between the 4
items/threads in work group 0, and the second copy will be shared
between the 4 items/threads in group 1.

Now, it's fine for a particular implementation to decide it only wants
to ever run one work group at a time. So this is an ok choice inside
CodeGen if you know what target you're compiling for.
My original point was that making such a lowering decision in the AST
is overly restrictive.

(I hope I'm not being unnecessarily picky.)

I see that Peter's proposed patch has not made into SVN, so I won't
file a bug. Instead I'll wait and monitor ARM's patches. (No rush,
honest!)

cheers,
david

From: David Neto [mailto:dneto.llvm@gmail.com]
Sent: Tuesday, December 07, 2010 1:03 PM
To: Villmow, Micah
Cc: cfe-dev@cs.uiuc.edu; llvmdev@cs.uiuc.edu
Subject: Re: [cfe-dev] [LLVMdev] OpenCL support

>> From: llvmdev-bounces@cs.uiuc.edu [mailto:llvmdev-
bounces@cs.uiuc.edu]
>> On Behalf Of Peter Collingbourne
>> Sent: Monday, December 06, 2010 2:56 PM
>> To: David Neto
>> Cc: cfe-dev@cs.uiuc.edu; llvmdev@cs.uiuc.edu
>> Subject: Re: [LLVMdev] [cfe-dev] OpenCL support
>>
>> Hi David,
>>
>> > What do I think your patch should look like? It's true that the
>> > diag::err_as_qualified_auto_decl is inappropriate for OpenCL when
>> it's
>> > the __local addres space.
>> >
>> > But we need to implement the semantics somehow. Conceptually I
think
>> > of it as a CL source-to-source transformation that lowers
>> > function-scope-local-address-space variables into a more primitive
>> > form.
>> >
>> > I think I disagree that the Clang is an inappropriate spot for
>> > implementing this type of transform: Clang "knows" the source
>> language
>> > semantics, and has a lot of machinery required for the transform.
>> > Also, Clang also knows a lot about the target machine (e.g. type
>> > sizes, builtins, more?).
>> >
>> > So I believe the "auto var in different address space" case should
be
>> > allowed in the AST in the OpenCL case, and the local-lowering
>> > transform should be applied in CodeGen. Perhaps the lowering is
>> > target-specific, e.g. GPU-style, or more generic style as I
proposed.
>> >
>> > Thoughts?
>>
>> I've been rethinking this and perhaps coming around to this way
>> of thinking. Allocating variables in the __local address space
>> is really something that can't be represented at the LLVM level,
>> at least in a standard form.
> [Villmow, Micah] We ran across this problem in our OpenCL
implementation. However, you can create a global variable with an
'__local' address space and it works fine. There is an issue with
collision between auto-arrays in different kernels, but that can be
solved with a little name mangling. There are other ways to do this,
for example, by converting local auto-arrays into kernel local pointer
arguments with a known size.

Here's a little example to show the direction I was heading, with an
illustration as a CL-to-C translation. I believe there are no
namespace issues, but otherwise is essentially the same as the global
variable solution.

The idea is that the func scope local addr variables are like a stack
frame that is shared between the different work items in a group. So
collect all those variables in an anonymous struct, and then create a
function scope private variable to point to the one copy of that
struct. The pointer is returned by a system-defined intrinsic
function dependent on the current work item. (The system knows what
work groups are in flight, which is why you need a system-defined
intrinsic.)

So a kernel function like this:

void foo(__global int*A) {
   __local int vint;
   __local int *vpint;
   __local int const *vcpint;
   __local int volatile vvint;
   int a = A[0];
   vint = a;
   vvint = a;
   int a2 = vint;
   int va2 = vvint;
   barrier(CLK_LOCAL_MEM_FENCE);
   A[0] = a2 + va2;
}

[Villmow, Micah] This example is incorrect. There is a race condition between the writes to vint and vvint and the reads from vint/vvint. The reason being is that all threads in a work-group share the memory that vint is allocated in. So if you have two work-items in a work-group, both work-items are writing to the same memory location and you don't know which thread is writing to that location. Also, the barrier is in the wrong location as you need to verify that all threads writes to your local before you can safely start reading to it. A quick example using two work items running in parallel.
Cycle WI 1 WI 2
1 read A[0]
2 read A[0]
3 write vint
4 write vint
5 write vvint
6 read vint
7 write vvint
8 read vint
10 read vvint
11 read vvint
12 barrier
13 barrier

It is pretty easy to see why it is a race condition. 1, there is no synchronization between store and load and two both work items are writing to the same memory. This is even more troublesome on AMD GPU's when we run up to 256 work items in parallel. I would keep this in mind when attempting to define some standard way to do this.

is translated to this, which does pass through Clang, with __local
meaning attrib addrspace(2):

extern __local void * __get_work_group_local_base_addr(void); //
intrinsic
void foo(__global int*A) {
   __local struct __local_vars_s {
      int vint;
      int *vpint;
      int const *vcpint;
      int volatile vvint;
   } * const __local_vars
            // this is a *private* variable, pointing to *local*
addresses.
            // it's a const pointer because it shouldn't change; and
being const may expose optimizations
       = __get_work_group_local_base_addr(); // the new intrinsic
   int a = A[0];
   __local_vars->vint = a; // l-values are translated as memory
stores.
   __local_vars->vvint = a;
   int a2 = __local_vars->vint; // r-values are translated as memory
loads
   int va2 = __local_vars->vvint;
   barrier(CLK_LOCAL_MEM_FENCE);
   A[0] = a2 + va2;
}

[Villmow, Micah] I'd prefer not to embed them in a structure for the simple reason that it isn't the way we do it. What we do for OpenCL at AMD on the GPU is we turn it into something like the following:
@cllocal_foo_vint int* addressspace(2) [1xi32]; <-- forgive my pseudo LLVM-IR.
@cllocal_foo_vvint int* addressspace(2) volatile [1xi32];
void foo(__global int*A) {
    __local int *vint = &cllocal_foo_vint;
    __local int *vpint; <-- this is just a pointer and requires no modification
    __local int const *vcpint; <-- same here, only arrays or variables require modification
    __local int volatile *vvint = &cllocal_foo_vvint;
    int a = A[0];
    *vint = a;
    *vvint = a;
    int a2 = *vint;
    int va2 = *vvint;
    A[0] = a2 + va2;
}

Also, don't forget there is a difference between local int a and local int* a, one is just a pointer to some pre-allocated memory and the other actually
requires memory to be allocated. Any implementation should ignore local pointers variables and only deal with local scalar, vector and array variables.

As an extension, the backend ought to be able to use some smarts to
simplify this down in simple cases. For example if the system only
ever allows one work group at a time, then the intrinsic could boil
down to returning a constant, and then link time optimization can
scrub away unneeded work. Similarly if you have a GPU style
environment where (as Peter described) the "local" addresses are the
same integer value but in different groups point to different storage,
then again the intrinsic returns a constant and again LTO optimizes
the result.

I haven't thought through the implications of a kernel having such
vars calling another kernel having such variables. At least the
OpenCL spec says that the behaviour is implementation-defined for such
a case. It would be nice to be able to represent any of the sane
possibilities.

[Villmow, Micah] A kernel with locally defined variables cannot call another kernel with local variables. This is illegal in OpenCL because when a kernel calls another kernel, that kernel is treated as a normal function. A normal function is not allowed to have local variables declared inside the body. Basically this restricts locally defined variables in a kernel body to the top level kernel function only.

Micah

This is a bit of a side issue for this thread. Yes, I understand the
race conditions, but thank you for being clear.

My point was to show a translation scheme for code, in particular how
storage is allocated, and how reads and writes are handled. The
original code was stupid. The job of the compiler is not to fix the
semantics of the code it is translating, but to faithfully translate
it. :slight_smile:

cheers,
david