RFC: Representation of OpenCL Memory Spaces

The problem I want to address in this discussion is the representation of OpenCL/CUDA memory spaces in LLVM IR. As support for OpenCL and CUDA mature within Clang, it is important that we provide a way to represent memory spaces in a way that is (1) sufficiently generic that other language front-ends can easily emit the needed annotations, and (2) sufficiently specific that LLVM optimization passes can perform aggressive optimizations.

1. Introduction

Support for OpenCL/CUDA, and potentially future language extensions, requires the compiler to differentiate between different types of memory. For example, OpenCL has a “__global” memory space which corresponds to globally-accessible data, and is usually off-chip memory in most GPU configurations; and a “__local” memory space which corresponds to work-group data (not accessible by work items outside of the current work group), and is usually on-chip scratchpad memory in most GPU configurations. This information is currently represented in Clang/LLVM using the addrspace() attribute on pointer types, where the OpenCL memory space to target address space mapping is defined by the requested target (e.g. PTX, X86, etc.).

This leads to a few issues. First, some existing targets already use LLVM address spaces for other purposes, so supporting OpenCL (as currently supported in Clang) on these targets would require significant re-structuring in the back-end. Second, LLVM address spaces do not provide enough semantic knowledge for optimization passes. For example, consider pointer aliasing in the following kernel:

__kernel

void foo(__global float* a, __local float* b) {
b[0] = a[0];
}

If we compile this with Clang targeting PTX, the resulting LLVM IR will be:

target datalayout = “e-p:32:32-i64:64:64-f64:64:64-n1:8:16:32:64”
target triple = “ptx32–”

define ptx_kernel void @foo(float* nocapture %a, float addrspace(4)* nocapture %b) nounwind noinline {
entry:
%0 = load float* %a, align 4, !tbaa !1
store float %0, float addrspace(4)* %b, align 4, !tbaa !1
ret void
}

!opencl.kernels = !{!0}

!0 = metadata !{void (float*, float addrspace(4)) @foo}
!1 = metadata !{metadata !“float”, metadata !2}
!2 = metadata !{metadata !“omnipotent char”, metadata !3}
!3 = metadata !{metadata !“Simple C/C++ TBAA”, null}

Does the load from %a alias the store to %b? Using the semantics of OpenCL, they cannot alias since they correspond to two different memory spaces. However, if we just look at the information in the LLVM IR, then basic alias analysis cannot determine if aliasing occurs because disjoint memory is not a property of LLVM address spaces. Therefore, we are not able to optimize as much as we could.

It is becoming increasingly clear to me that LLVM address spaces are not the general solution to OpenCL/CUDA memory spaces. They are a convenient hack to get things working in the short term, but I think a more long-term approach should be discussed and decided upon now before the OpenCL and CUDA implementations in Clang/LLVM get too mature. To be clear, I am not advocating that targets change to a different method for representing device memory spaces. The current use of address spaces to represent different types of device memory is perfectly valid, IMHO. However, this knowledge should not be encoded in front-ends and pre-SelectionDAG optimization passes.

2. Solutions

A couple of solutions to this problem are presented here, with the hope that the Clang/LLVM community will offer a constructive discussion on how best to proceed with OpenCL/CUDA support in Clang/LLVM. The following list is in no way meant to be exhaustive; it merely serves as a starting basis for discussion.

2A. Extend TBAA

In theory, the type-based alias analysis pass could be extended to (properly) support aliasing queries for pointers in OpenCL kernels. Currently, it has no way of knowing if two pointers in different address spaces can alias, and in fact cannot know if this is the case given the definition of LLVM address spaces. Instead of programming it with target-specific knowledge, it can be extended with language-specific knowledge. Instead of considering address spaces, the Clang portion of TBAA can be programmed to use OpenCL attributes to extend its pointer metadata. Specifically, pointers to different memory spaces are in essence different types and cannot alias. For the kernel shown above, the resulting LLVM IR could be:

; ModuleID = ‘test1.cl
target datalayout = “e-p:32:32-i64:64:64-f64:64:64-n1:8:16:32:64”
target triple = “ptx32–”

define ptx_kernel void @foo(float* nocapture %a, float addrspace(4)* nocapture %b) nounwind noinline {
entry:
%0 = load float* %a, align 4, !tbaa !1
store float %0, float addrspace(4)* %b, align 4, !tbaa !2
ret void
}

!opencl.kernels = !{!0}

!0 = metadata !{void (float*, float addrspace(4)) @foo}
!1 = metadata !{metadata !“float$__global”, metadata !3}
!2 = metadata !{metadata !“float$__local”, metadata !3}
!3 = metadata !{metadata !“omnipotent char”, metadata !4}
!4 = metadata !{metadata !“Simple C/C++ TBAA”, null}

Differences are bolded. Here, the TBAA pass would be able to identify that the loads and stores do not alias. Of course, when compiling in non-OpenCL/CUDA mode, TBAA would work just as before.

Pros:

Relatively easy to implement

Cons:

Does not solve the full problem, such as how to represent OpenCL memory spaces in other backends, such as X86 which uses LLVM address spaces for different purposes.

I see this solution as more of a short-term hack to solve the pointer aliasing issue without actually addressing the larger issues.

2B. Emit OpenCL/CUDA-specific Metadata or Attributes

Instead of using LLVM address spaces to represent OpenCL/CUDA memory spaces, language-specific annotations can be provided on types. This can take the form of metadata, or additional LLVM IR attributes on types and parameters, such as:

; ModuleID = ‘test1.cl
target datalayout = “e-p:32:32-i64:64:64-f64:64:64-n1:8:16:32:64”
target triple = “ptx32–”

define ocl_kernel void @foo(float* nocapture ocl_global %a, float* nocapture ocl_local %b) nounwind noinline {
entry:
%0 = load float* %a, align 4
store float %0, float* %b, align 4
ret void
}

Instead of extending the LLVM IR language, this information could also be encoded as metadata by either (1) emitting some global metadata that binds useful properties to globals and parameters, or (2) extending LLVM IR to allow attributes on parameters and globals.

Optimization passes can make use of these additional attributes to derive useful properties, such as %a cannot alias %b. Then, back-ends can use these attributes to emit proper code sequences based on the pointer attributes.

Pros:

If done right, would solve the general problem

Cons:

Large implementation commitment; could potentially touch many parts of LLVM.

Any comments?

Hi Justin,

Thanks for bringing this up, I think it's important to discuss
these issues here.

It is becoming increasingly clear to me that LLVM address spaces are not the
general solution to OpenCL/CUDA memory spaces. They are a convenient hack to
get things working in the short term, but I think a more long-term approach
should be discussed and decided upon now before the OpenCL and CUDA
implementations in Clang/LLVM get too mature. To be clear, I am not
advocating that *targets* change to a different method for representing
device memory spaces. The current use of address spaces to represent
different types of device memory is perfectly valid, IMHO. However, this
knowledge should not be encoded in front-ends and pre-SelectionDAG
optimization passes.

I disagree. The targets should expose all the address spaces they
provide, and the frontend should know about the various address spaces
it needs to know about. It is incumbent on the frontend to deliver
a valid IR for a particular language implementation, and part of
that involves knowing about the ABI requirements for the language
implementation (which may involve using specific address spaces)
and the capabilities of each target (including the capabilities of
the target's address spaces), together with the language semantics.
It is not the job of the optimisers or backend to know the semantics
for a specific language, a specific implementation of that language
or a specific ABI.

*2. Solutions*

A couple of solutions to this problem are presented here, with the hope that
the Clang/LLVM community will offer a constructive discussion on how best to
proceed with OpenCL/CUDA support in Clang/LLVM. The following list is in no
way meant to be exhaustive; it merely serves as a starting basis for
discussion.

*2A. Extend TBAA*

In theory, the type-based alias analysis pass could be extended to
(properly) support aliasing queries for pointers in OpenCL kernels.
Currently, it has no way of knowing if two pointers in different address
spaces can alias, and in fact cannot know if this is the case given the
definition of LLVM address spaces. Instead of programming it with
target-specific knowledge, it can be extended with language-specific
knowledge. Instead of considering address spaces, the Clang portion of TBAA
can be programmed to use OpenCL attributes to extend its pointer metadata.
Specifically, pointers to different memory spaces are in essence different
types and cannot alias. For the kernel shown above, the resulting LLVM IR
could be:

; ModuleID = 'test1.cl'
target datalayout = "e-p:32:32-i64:64:64-f64:64:64-n1:8:16:32:64"
target triple = "ptx32--"

define ptx_kernel void @foo(float* nocapture %a, float addrspace(4)*
nocapture %b) nounwind noinline {
entry:
  %0 = load float* %a, align 4, !tbaa !1
  store float %0, float addrspace(4)* %b, align 4, !tbaa *!2*
  ret void
}

!opencl.kernels = !{!0}

!0 = metadata !{void (float*, float addrspace(4)*)* @foo}
*!1 = metadata !{metadata !"float$__global", metadata !3}*
*!2 = metadata !{metadata !"float$__local", metadata !3}*
!3 = metadata !{metadata !"omnipotent char", metadata !4}
!4 = metadata !{metadata !"Simple C/C++ TBAA", null}

Differences are bolded. Here, the TBAA pass would be able to identify that
the loads and stores do not alias. Of course, when compiling in
non-OpenCL/CUDA mode, TBAA would work just as before.

I have to say that I much prefer the TBAA solution, as it encodes the
language semantics using the existing metadata for language semantics.

*Pros:*

Relatively easy to implement

*Cons:*

Does not solve the full problem, such as how to represent OpenCL memory
spaces in other backends, such as X86 which uses LLVM address spaces for
different purposes.

This presupposes that we need a way of representing OpenCL address
spaces in IR targeting X86 (and targets which lack GPU-like address
spaces). As far as I can tell, the only real representations of
OpenCL address spaces on such targets that we need are a way of
distinguishing the different address spaces for alias analysis
and a representation for __local variables allocated on the stack.
TBAA metadata would solve the first problem, and we already have
mechanisms in the frontend that could be used to solve the second.

I see this solution as more of a short-term hack to solve the pointer
aliasing issue without actually addressing the larger issues.

I remain to be persuaded that there are any "larger issues" to solve.

*2B. Emit OpenCL/CUDA-specific Metadata or Attributes*

Instead of using LLVM address spaces to represent OpenCL/CUDA memory spaces,
language-specific annotations can be provided on types. This can take the
form of metadata, or additional LLVM IR attributes on types and parameters,
such as:

; ModuleID = 'test1.cl'
target datalayout = "e-p:32:32-i64:64:64-f64:64:64-n1:8:16:32:64"
target triple = "ptx32--"

define *ocl_kernel* void @foo(float* nocapture *ocl_global* %a, float*
nocapture *ocl_local* %b) nounwind noinline {
entry:
  %0 = load float* %a, align 4
  store float %0, float* %b, align 4
  ret void
}

Instead of extending the LLVM IR language, this information could also be
encoded as metadata by either (1) emitting some global metadata that binds
useful properties to globals and parameters, or (2) extending LLVM IR to
allow attributes on parameters and globals.

Optimization passes can make use of these additional attributes to derive
useful properties, such as %a cannot alias %b. Then, back-ends can use these
attributes to emit proper code sequences based on the pointer attributes.

*Pros:*
*
*
If done right, would solve the general problem

*Cons:*
*
*
Large implementation commitment; could potentially touch many parts of LLVM.

You are being vague about what is required here. A complete solution
following 2B would involve allowing these attributes on all pointer
types. It would be very expensive to allow custom attributes or
metadata on pointer types, since they are used frequently in the IR,
and the common case is not to have attributes or metadata. Also,
depending on how this is implemented, this would encode far too much
language specific information in the IR.

Thanks,

Justin,
Out of these options, I would take the metadata approach for AA support.

This doesn't solve the problem of different frontend/backends choosing different
address space representations for the same language, but is the correct
approach for providing extra information to the optimizations.

The issue about memory spaces in general is a little different. For example, based on
the code you posted below, address space 0(default) is global in CUDA, but
in OpenCL, the default address space is private. So, how does the ptx backend
handle the differences? I think this is problematic as address spaces
are language constructs and hardcoded at the frontend, but the backend needs to be
able to interpret them differently based on the source language.

One way this could be done is to have the backends have options, but then
each backend would need to implement this. I think a better approach is
to have some way to represent address spaces generically in the module.

Micah

Hi Justin,

Thanks for bringing this up, I think it’s important to discuss
these issues here.

It is becoming increasingly clear to me that LLVM address spaces are not the
general solution to OpenCL/CUDA memory spaces. They are a convenient hack to
get things working in the short term, but I think a more long-term approach
should be discussed and decided upon now before the OpenCL and CUDA
implementations in Clang/LLVM get too mature. To be clear, I am not

advocating that targets change to a different method for representing

device memory spaces. The current use of address spaces to represent
different types of device memory is perfectly valid, IMHO. However, this
knowledge should not be encoded in front-ends and pre-SelectionDAG
optimization passes.

I disagree. The targets should expose all the address spaces they
provide, and the frontend should know about the various address spaces
it needs to know about. It is incumbent on the frontend to deliver
a valid IR for a particular language implementation, and part of
that involves knowing about the ABI requirements for the language
implementation (which may involve using specific address spaces)
and the capabilities of each target (including the capabilities of
the target’s address spaces), together with the language semantics.
It is not the job of the optimisers or backend to know the semantics
for a specific language, a specific implementation of that language
or a specific ABI.

But this is assuming that a target’s address spaces have a valid 1 to 1 mapping between OpenCL memory spaces and back-end address spaces. What happens for a target such as x86? Do we introduce pseudo address spaces into the back-end just to satisfy the front-end OpenCL requirements?

2. Solutions

A couple of solutions to this problem are presented here, with the hope that
the Clang/LLVM community will offer a constructive discussion on how best to
proceed with OpenCL/CUDA support in Clang/LLVM. The following list is in no
way meant to be exhaustive; it merely serves as a starting basis for
discussion.

2A. Extend TBAA

In theory, the type-based alias analysis pass could be extended to
(properly) support aliasing queries for pointers in OpenCL kernels.
Currently, it has no way of knowing if two pointers in different address
spaces can alias, and in fact cannot know if this is the case given the
definition of LLVM address spaces. Instead of programming it with
target-specific knowledge, it can be extended with language-specific
knowledge. Instead of considering address spaces, the Clang portion of TBAA
can be programmed to use OpenCL attributes to extend its pointer metadata.
Specifically, pointers to different memory spaces are in essence different
types and cannot alias. For the kernel shown above, the resulting LLVM IR
could be:

; ModuleID = ‘test1.cl
target datalayout = “e-p:32:32-i64:64:64-f64:64:64-n1:8:16:32:64”
target triple = “ptx32–”

define ptx_kernel void @foo(float* nocapture %a, float addrspace(4)*
nocapture %b) nounwind noinline {
entry:
%0 = load float* %a, align 4, !tbaa !1

store float %0, float addrspace(4)* %b, align 4, !tbaa !2

ret void
}

!opencl.kernels = !{!0}

!0 = metadata !{void (float*, float addrspace(4)) @foo}

!1 = metadata !{metadata !“float$__global”, metadata !3}
!2 = metadata !{metadata !“float$__local”, metadata !3}

!3 = metadata !{metadata !“omnipotent char”, metadata !4}
!4 = metadata !{metadata !“Simple C/C++ TBAA”, null}

Differences are bolded. Here, the TBAA pass would be able to identify that
the loads and stores do not alias. Of course, when compiling in
non-OpenCL/CUDA mode, TBAA would work just as before.

I have to say that I much prefer the TBAA solution, as it encodes the
language semantics using the existing metadata for language semantics.

It’s certainly the easiest to implement and would have the least impact (practically zero) on existing passes.

Pros:

Relatively easy to implement

Cons:

Does not solve the full problem, such as how to represent OpenCL memory
spaces in other backends, such as X86 which uses LLVM address spaces for
different purposes.

This presupposes that we need a way of representing OpenCL address
spaces in IR targeting X86 (and targets which lack GPU-like address
spaces). As far as I can tell, the only real representations of
OpenCL address spaces on such targets that we need are a way of
distinguishing the different address spaces for alias analysis
and a representation for __local variables allocated on the stack.
TBAA metadata would solve the first problem, and we already have
mechanisms in the frontend that could be used to solve the second.

Which mechanisms could be used to differentiate between thread-private and __local data?

I see this solution as more of a short-term hack to solve the pointer
aliasing issue without actually addressing the larger issues.

I remain to be persuaded that there are any “larger issues” to solve.

2B. Emit OpenCL/CUDA-specific Metadata or Attributes

Instead of using LLVM address spaces to represent OpenCL/CUDA memory spaces,
language-specific annotations can be provided on types. This can take the
form of metadata, or additional LLVM IR attributes on types and parameters,
such as:

; ModuleID = ‘test1.cl
target datalayout = “e-p:32:32-i64:64:64-f64:64:64-n1:8:16:32:64”
target triple = “ptx32–”

define ocl_kernel void @foo(float* nocapture ocl_global %a, float*
nocapture ocl_local %b) nounwind noinline {

entry:
%0 = load float* %a, align 4
store float %0, float* %b, align 4
ret void
}

Instead of extending the LLVM IR language, this information could also be
encoded as metadata by either (1) emitting some global metadata that binds
useful properties to globals and parameters, or (2) extending LLVM IR to
allow attributes on parameters and globals.

Optimization passes can make use of these additional attributes to derive
useful properties, such as %a cannot alias %b. Then, back-ends can use these
attributes to emit proper code sequences based on the pointer attributes.

Pros:
*

If done right, would solve the general problem

Cons:
*

Large implementation commitment; could potentially touch many parts of LLVM.

You are being vague about what is required here. A complete solution
following 2B would involve allowing these attributes on all pointer
types. It would be very expensive to allow custom attributes or
metadata on pointer types, since they are used frequently in the IR,
and the common case is not to have attributes or metadata. Also,
depending on how this is implemented, this would encode far too much
language specific information in the IR.

I agree that this would be expensive, and I’m not necessarily advocating it. If the consensus is that TBAA extensions are sufficient for all cases, then I’m fine with that. It’s much less work. :slight_smile:

I just want to make sure we’re covering all of our bases before we proceed too far with this.

Address space 0 (i.e. the default address space) should always be the
address space on which the stack resides. This is a requirement for
alloca to work correctly. So for PTX, I think that address space 0
should be the local state space (but I noticed that at the moment it
is the global state space, which seems wrong IMHO).

As I mentioned in my previous email, I don't think that the backend
should interpret address spaces for the source language, as this
places too much language-specific functionality in the backend.

The situation regarding default address spaces in CUDA is more
complex, but suffice it to say that there is usually no such thing
as a "default" address space in CUDA, because the language does not
contain support for address space qualified pointer types (only address
space qualified declarations). NVIDIA's CUDA compiler, nvopencc,
determines the correct address space for each pointer using type
inference (there is an explanation of nvopencc's algorithm in the
src/doc/ssa_memory_space.txt file in the nvopencc distribution).
Our compiler should eventually contain a similar algorithm.

Thanks,

Justin,
Out of these options, I would take the metadata approach for AA support.

This doesn’t solve the problem of different frontend/backends choosing different
address space representations for the same language, but is the correct
approach for providing extra information to the optimizations.

The issue about memory spaces in general is a little different. For example, based on
the code you posted below, address space 0(default) is global in CUDA, but
in OpenCL, the default address space is private. So, how does the ptx backend
handle the differences? I think this is problematic as address spaces
are language constructs and hardcoded at the frontend, but the backend needs to be
able to interpret them differently based on the source language.

One way this could be done is to have the backends have options, but then
each backend would need to implement this. I think a better approach is
to have some way to represent address spaces generically in the module.

That’s sort of where I was trying to go with this. I’m thinking of some sort of annotation like address spaces, but with semantic properties associated with them instead of leaving the definitions solely up to the target. Then again, this may be too high-level for LLVM IR, which is target dependent to begin with.

Justin,
Out of these options, I would take the metadata approach for AA support.

This doesn’t solve the problem of different frontend/backends choosing different
address space representations for the same language, but is the correct
approach for providing extra information to the optimizations.

The issue about memory spaces in general is a little different. For example, based on
the code you posted below, address space 0(default) is global in CUDA, but
in OpenCL, the default address space is private. So, how does the ptx backend
handle the differences? I think this is problematic as address spaces
are language constructs and hardcoded at the frontend, but the backend needs to be
able to interpret them differently based on the source language.

One way this could be done is to have the backends have options, but then
each backend would need to implement this. I think a better approach is
to have some way to represent address spaces generically in the module.

Address space 0 (i.e. the default address space) should always be the
address space on which the stack resides. This is a requirement for
alloca to work correctly. So for PTX, I think that address space 0
should be the local state space (but I noticed that at the moment it
is the global state space, which seems wrong IMHO).

This is a bit hacky in the back-end at the moment. When I started working with the back-end, address space 0 was already defined as global, and I have not broken that convention yet.

Then again, the issue is not really that big of a deal, since we need to specially handle all “stack” accesses anyway. It doesn’t really matter much what address space is used.

Hi,

Tanya and I also prefer the extended TBAA solution as it naturally fits with LLVM. From my understanding of TBAA, it seems to provide the power to describe the relationship between address spaces for alias analysis, i.e., it can describe if two address spaces are disjoint or one may nest within another. For OpenCL, it is most useful to indicate that address spaces are disjoint from the point of view of alias analysis even though the underlying memory may be the same like in x86. The question is there something missing in TBAA that it can’t properly describe the semantics we want for an address space?

– Mon Ping

Hi,

Tanya and I also prefer the extended TBAA solution as it naturally fits with LLVM. From my understanding of TBAA, it seems to provide the power to describe the relationship between address spaces for alias analysis, i.e., it can describe if two address spaces are disjoint or one may nest within another. For OpenCL, it is most useful to indicate that address spaces are disjoint from the point of view of alias analysis even though the underlying memory may be the same like in x86. The question is there something missing in TBAA that it can’t properly describe the semantics we want for an address space?

From what I can tell, extending TBAA is perfectly fine for the alias problem. I really just want to make sure we’re providing enough hooks in the front-end and IR so that any back-end can be used for OpenCL code gen.

What kind of special handling would be required? And how can you
always tell whether or not an access through address space 0 would
be a stack access? For example, consider the attached .ll file,
which compiles to a global store here.

Thanks,

localglobal.ll (514 Bytes)

> Hi Justin,
>
> Thanks for bringing this up, I think it's important to discuss
> these issues here.
>
> > It is becoming increasingly clear to me that LLVM address spaces are not
> the
> > general solution to OpenCL/CUDA memory spaces. They are a convenient hack
> to
> > get things working in the short term, but I think a more long-term
> approach
> > should be discussed and decided upon now before the OpenCL and CUDA
> > implementations in Clang/LLVM get too mature. To be clear, I am not
> > advocating that *targets* change to a different method for representing
> > device memory spaces. The current use of address spaces to represent
> > different types of device memory is perfectly valid, IMHO. However, this
> > knowledge should not be encoded in front-ends and pre-SelectionDAG
> > optimization passes.
>
> I disagree. The targets should expose all the address spaces they
> provide, and the frontend should know about the various address spaces
> it needs to know about. It is incumbent on the frontend to deliver
> a valid IR for a particular language implementation, and part of
> that involves knowing about the ABI requirements for the language
> implementation (which may involve using specific address spaces)
> and the capabilities of each target (including the capabilities of
> the target's address spaces), together with the language semantics.
> It is not the job of the optimisers or backend to know the semantics
> for a specific language, a specific implementation of that language
> or a specific ABI.
>

But this is assuming that a target's address spaces have a valid 1 to 1
mapping between OpenCL memory spaces and back-end address spaces. What
happens for a target such as x86? Do we introduce pseudo address spaces
into the back-end just to satisfy the front-end OpenCL requirements?

I don't see how anything I wrote implies that. For x86, there would
presumably be a many-to-one mapping.

> This presupposes that we need a way of representing OpenCL address
> spaces in IR targeting X86 (and targets which lack GPU-like address
> spaces). As far as I can tell, the only real representations of
> OpenCL address spaces on such targets that we need are a way of
> distinguishing the different address spaces for alias analysis
> and a representation for __local variables allocated on the stack.
> TBAA metadata would solve the first problem, and we already have
> mechanisms in the frontend that could be used to solve the second.
>

Which mechanisms could be used to differentiate between thread-private and
__local data?

In OpenCL C, it is illegal to declare a variable with static storage
duration in the __private address space (section 6.5: "All program
scope variables must be declared in the __constant address space.";
section 6.8g: "The extern, static, auto and register storage-class
specifiers are not supported."). This implies that there is no way
for pointers to the __private address space to be usefully shared
between work-items without invoking undefined behaviour, so the
question is moot (i.e. __private does not need to be implemented using
thread-local storage).

It is possible to write OpenCL C code which shares pointers to
__private memory using barrier synchronisation, but since there is no
way to queue a memory fence across __private memory (only __local and
__global), any access to that memory would invoke undefined behaviour.
For example, consider the following (2 work-items in a work-group):

__kernel void foo() {
  int x = 0;
  int *__local p;
  if (get_local_id(0) == 0) p = &x;
  barrier(CLK_LOCAL_MEM_FENCE);
  if (get_local_id(0) == 1) *p = 1;
  barrier(CLK_LOCAL_MEM_FENCE);
  // what is the value of x in work-item 0 here?
}
  
The value of x at the comment is undefined, because no fence across
__private memory was queued.

Perhaps more straightforwardly, referring to the following passage
in section 3.3 ("memory model") of the OpenCL specification:

"Private Memory: A region of memory private to a work-item. Variables
defined in one work-item's private memory are not visible to another
work-item."

We can interpret the term "not visible" here as meaning that accesses
across work-items invoke undefined behaviour, so in the example above,
the write to x via p would itself be undefined.

Thanks,

From: Peter Collingbourne [mailto:peter@pcc.me.uk]
Sent: Friday, October 14, 2011 9:55 AM
To: Justin Holewinski
Cc: Villmow, Micah; LLVM Developers Mailing List
Subject: Re: [LLVMdev] [cfe-dev] RFC: Representation of OpenCL Memory
Spaces

>
> > > Justin,
> > > Out of these options, I would take the metadata approach for AA
support.
> > >
> > > This doesn't solve the problem of different frontend/backends
> > > choosing
> > different
> > > address space representations for the same language, but is the
> > > correct approach for providing extra information to the
optimizations.
> > >
> > > The issue about memory spaces in general is a little different.
> > > For
> > example, based on
> > > the code you posted below, address space 0(default) is global in
> > > CUDA,
> > but
> > > in OpenCL, the default address space is private. So, how does the
> > > ptx
> > backend
> > > handle the differences? I think this is problematic as address
> > > spaces are language constructs and hardcoded at the frontend, but
> > > the backend
> > needs to be
> > > able to interpret them differently based on the source language.
> > >
> > > One way this could be done is to have the backends have options,
> > > but then each backend would need to implement this. I think a
> > > better approach is to have some way to represent address spaces
generically in the module.
> >
> > Address space 0 (i.e. the default address space) should always be
> > the address space on which the stack resides. This is a
requirement
> > for alloca to work correctly. So for PTX, I think that address
> > space 0 should be the local state space (but I noticed that at the
> > moment it is the global state space, which seems wrong IMHO).
> >
>
> This is a bit hacky in the back-end at the moment. When I started
> working with the back-end, address space 0 was already defined as
> global, and I have not broken that convention yet.
>
> Then again, the issue is not really that big of a deal, since we need
> to specially handle all "stack" accesses anyway. It doesn't really
> matter much what address space is used.

What kind of special handling would be required? And how can you
always tell whether or not an access through address space 0 would be a
stack access? For example, consider the attached .ll file, which
compiles to a global store here.

[Villmow, Micah] If this was generated from OpenCL, then it is an invalid program as the default address space is private to the thread and you cannot have global variables in the private address space.

Indeed, but it is (at present) a valid LLVM IR for PTX. The .ll file
illustrates the issue with having address space 0 map to the global
state space, as it does in the current PTX backend.

Thanks,

Justin,
Out of these options, I would take the metadata approach for AA support.

This doesn’t solve the problem of different frontend/backends choosing
different
address space representations for the same language, but is the correct
approach for providing extra information to the optimizations.

The issue about memory spaces in general is a little different. For
example, based on
the code you posted below, address space 0(default) is global in CUDA,
but
in OpenCL, the default address space is private. So, how does the ptx
backend
handle the differences? I think this is problematic as address spaces
are language constructs and hardcoded at the frontend, but the backend
needs to be
able to interpret them differently based on the source language.

One way this could be done is to have the backends have options, but then
each backend would need to implement this. I think a better approach is
to have some way to represent address spaces generically in the module.

Address space 0 (i.e. the default address space) should always be the
address space on which the stack resides. This is a requirement for
alloca to work correctly. So for PTX, I think that address space 0
should be the local state space (but I noticed that at the moment it
is the global state space, which seems wrong IMHO).

This is a bit hacky in the back-end at the moment. When I started working
with the back-end, address space 0 was already defined as global, and I have
not broken that convention yet.

Then again, the issue is not really that big of a deal, since we need to
specially handle all “stack” accesses anyway. It doesn’t really matter much
what address space is used.

What kind of special handling would be required? And how can you
always tell whether or not an access through address space 0 would
be a stack access? For example, consider the attached .ll file,
which compiles to a global store here.

Yes, this is currently an issue with the back-end. The handling of stack space is definitely a hack at the moment, but I have not had the time to address it since it currently works in the typical use case.

Hi Justin,

Thanks for bringing this up, I think it’s important to discuss
these issues here.

It is becoming increasingly clear to me that LLVM address spaces are not
the
general solution to OpenCL/CUDA memory spaces. They are a convenient hack
to
get things working in the short term, but I think a more long-term
approach
should be discussed and decided upon now before the OpenCL and CUDA
implementations in Clang/LLVM get too mature. To be clear, I am not
advocating that targets change to a different method for representing
device memory spaces. The current use of address spaces to represent
different types of device memory is perfectly valid, IMHO. However, this
knowledge should not be encoded in front-ends and pre-SelectionDAG
optimization passes.

I disagree. The targets should expose all the address spaces they
provide, and the frontend should know about the various address spaces
it needs to know about. It is incumbent on the frontend to deliver
a valid IR for a particular language implementation, and part of
that involves knowing about the ABI requirements for the language
implementation (which may involve using specific address spaces)
and the capabilities of each target (including the capabilities of
the target’s address spaces), together with the language semantics.
It is not the job of the optimisers or backend to know the semantics
for a specific language, a specific implementation of that language
or a specific ABI.

But this is assuming that a target’s address spaces have a valid 1 to 1
mapping between OpenCL memory spaces and back-end address spaces. What
happens for a target such as x86? Do we introduce pseudo address spaces
into the back-end just to satisfy the front-end OpenCL requirements?

I don’t see how anything I wrote implies that. For x86, there would
presumably be a many-to-one mapping.

This presupposes that we need a way of representing OpenCL address
spaces in IR targeting X86 (and targets which lack GPU-like address
spaces). As far as I can tell, the only real representations of
OpenCL address spaces on such targets that we need are a way of
distinguishing the different address spaces for alias analysis
and a representation for __local variables allocated on the stack.
TBAA metadata would solve the first problem, and we already have
mechanisms in the frontend that could be used to solve the second.

Which mechanisms could be used to differentiate between thread-private and
__local data?

In OpenCL C, it is illegal to declare a variable with static storage
duration in the __private address space (section 6.5: “All program
scope variables must be declared in the __constant address space.”;
section 6.8g: “The extern, static, auto and register storage-class
specifiers are not supported.”). This implies that there is no way
for pointers to the __private address space to be usefully shared
between work-items without invoking undefined behaviour, so the
question is moot (i.e. __private does not need to be implemented using
thread-local storage).

It is possible to write OpenCL C code which shares pointers to
__private memory using barrier synchronisation, but since there is no
way to queue a memory fence across __private memory (only __local and
__global), any access to that memory would invoke undefined behaviour.
For example, consider the following (2 work-items in a work-group):

__kernel void foo() {
int x = 0;
int *__local p;
if (get_local_id(0) == 0) p = &x;
barrier(CLK_LOCAL_MEM_FENCE);
if (get_local_id(0) == 1) *p = 1;
barrier(CLK_LOCAL_MEM_FENCE);
// what is the value of x in work-item 0 here?
}

The value of x at the comment is undefined, because no fence across
__private memory was queued.

Perhaps more straightforwardly, referring to the following passage
in section 3.3 (“memory model”) of the OpenCL specification:

“Private Memory: A region of memory private to a work-item. Variables
defined in one work-item’s private memory are not visible to another
work-item.”

We can interpret the term “not visible” here as meaning that accesses
across work-items invoke undefined behaviour, so in the example above,
the write to x via p would itself be undefined.

I was referring more to the front-end aspects here. Let’s say we have:

__kernel void foo() {
float privateBuffer[8];
__local float localBuffer[8];
}

What mechanisms, other than address spaces, can we use to tell the X86 back-end that privateBuffer is private to the thread, and localBuffer is shared among all threads in a work-group?

There is no need to tell the x86 backend that privateBuffer is
private to the thread. For the reasons I explained, there is no
way for work-items to usefully get pointers to other work-items'
privateBuffer objects, so as long as privateBuffer is allocated
as an automatic variable (i.e. on the stack), there is no other
special treatment required.

As for localBuffer, the IR generator would emit accesses to __local
variables in an implementation-specific way, and the IR generator
already contains a mechanism for doing so. In this mailing list
post I explained in more detail the CGOpenCLRuntime class that is
used to do this:

http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20110815/045187.html

The "hidden pointer argument" technique is the one most suited to x86,
but this has not actually been implemented.

Thanks,