Loads moving across barriers

Hi,

For a long time we've been having a problem we've been working around in OpenCL where loads are moving across an intrinsic used for a barrier. Attached is the testcase, and the result of opt -S -basicaa -gvn on it.

This example is essentially this:

void foo(global float2* result, local float2* restrict data0, ...)
{
     int id = get_local_id(0);
     // ...
     data0[id] = ...;
     barrier();
     if (id < N)
     {
         float2 x = data0[idx];
         int other_index = ...;
         data0[other_index] = x;
      }
      barrier();
      result[id] = data0[id];
}

This is transformed so that the load from data0 after the second barrier never occurs, but it is necessary. The final value written is replaced with a phi, so the value isn't reloaded for the threads that don't take the id < N branch. The threads that did take the branch did write to the same index, so the load needs to occur after the barrier. This transform does not occur if noalias is removed from the %data0 parameter. The basic question ends up being if this is the intended behavior of noalias / restrict or just a bug.

Here are 2 previous threads about attempts at fixing this problem:
http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-June/062895.html
http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-August/064594.html

Both of these I think sort of went in the wrong direction and talked specifically about the semantics of the atomic instructions (fence in particular), which isn't the real question. Is noalias supposed to mean that no other thread can also have a copy of the pointer it also modifies? My guess at what was happening is that since the parameter is noalias, the assumption is there is no possible way for the side-effecting function to modify the pointer. The second thread brings up an ambiguity in the C spec about how restrict is supposed to be interpreted in the presense of multiple threads. OpenCL still has restrict, but unless this is supposed to work, it is pretty close to useless.

Right now we are working around this with a custom alias analysis pass inserted that reports calls to the barrier intrinsics modify anything with the right address spaces. Is a new intrinsic necessary to get the right semantics for this case? Even if this worked correctly as it is now, I don't believe there is any way to truly specify the fence per address space so that other memory accesses to non-fenced ones could move across it.

Thanks for any help

after_gvn.ll (1.69 KB)

before_gvn.ll (2.04 KB)

Hi Matt,

Hi Matt,

> Both of these I think sort of went in the wrong direction and
> talked specifically about the semantics of the atomic instructions
> (fence in particular), which isn't the real question. Is noalias
> supposed to mean that no other thread can also have a copy of the
> pointer it also modifies? My guess at what was happening is that
> since the parameter is noalias, the assumption is there is no
> possible way for the side-effecting function to modify the
> pointer. The second thread brings up an ambiguity in the C spec
> about how restrict is supposed to be interpreted in the presense
> of multiple threads. OpenCL still has restrict, but unless this is
> supposed to work, it is pretty close to useless.

I checked the OpenCL specification, and it doesn’t give any clear
definition of restrict beyond implicitly importing what C99 says.
That said, I think it’s is pretty clearly undesirable behavior for
CL, even if it may (or may not) be technically permitted by the C
specification. I’d be in favor of clarifying our definition of
noalias to disallow this transformation.

So we're specifically talking about intrinsics tagged as having unmodeled side effects? Or arbitrary functions?

-Hal

Hi Matt,

Both of these I think sort of went in the wrong direction and
talked specifically about the semantics of the atomic instructions
(fence in particular), which isn't the real question. Is noalias
supposed to mean that no other thread can also have a copy of the
pointer it also modifies? My guess at what was happening is that
since the parameter is noalias, the assumption is there is no
possible way for the side-effecting function to modify the
pointer. The second thread brings up an ambiguity in the C spec
about how restrict is supposed to be interpreted in the presense
of multiple threads. OpenCL still has restrict, but unless this is
supposed to work, it is pretty close to useless.

I checked the OpenCL specification, and it doesn’t give any clear
definition of restrict beyond implicitly importing what C99 says.
That said, I think it’s is pretty clearly undesirable behavior for
CL, even if it may (or may not) be technically permitted by the C
specification. I’d be in favor of clarifying our definition of
noalias to disallow this transformation.

So we're specifically talking about intrinsics tagged as having unmodeled side effects? Or arbitrary functions?

We don't have such a tag for intrinsics to my knowledge.

-Owen

Hi Matt,

Both of these I think sort of went in the wrong direction and
talked specifically about the semantics of the atomic instructions
(fence in particular), which isn't the real question. Is noalias
supposed to mean that no other thread can also have a copy of the
pointer it also modifies? My guess at what was happening is that
since the parameter is noalias, the assumption is there is no
possible way for the side-effecting function to modify the
pointer. The second thread brings up an ambiguity in the C spec
about how restrict is supposed to be interpreted in the presense
of multiple threads. OpenCL still has restrict, but unless this is
supposed to work, it is pretty close to useless.

I checked the OpenCL specification, and it doesn’t give any clear
definition of restrict beyond implicitly importing what C99 says.
That said, I think it’s is pretty clearly undesirable behavior for
CL, even if it may (or may not) be technically permitted by the C
specification. I’d be in favor of clarifying our definition of
noalias to disallow this transformation.

So we're specifically talking about intrinsics tagged as having unmodeled side effects? Or arbitrary functions?

Arbitrary functions calls can contain barriers ( in CL) or thread joins (in C), so I think the same reasoning had to apply to them as well.

-Owen

I don’t think think outright disallowing this transform is the right solution. This would be valid for OpenCL private or constant address spaces, it’s just global or local where this would be a problem. This comes back to the questions about how to handle address space alias information which there was a long thread about a few months ago. There was debate over address spaces as a language vs. a target concept, and I don’t remember right now what the consensuses were there. The more I think about it, the more an OpenCL specific alias analysis makes sense, which could then use a men_fence intrinsic per address space.

So, I don't have a horse in the OpenCL race, and I'd like to mostly keep it
that way, but I want to point out that some of these things have much
broader implications.

I don’t think think outright disallowing this transform is the right
solution. This would be valid for OpenCL private or constant address
spaces, it’s just global or local where this would be a problem.

Unless I've misunderstood (which is a possibility, I've not invested proper
time thinking about this, which I'll do if you indicate I'm missing
something), this would be tantamount to tying certain address to specific
rules regarding potential data races in the memory model. I'm not very
comfortable with that. Perhaps you're instead trying to say that with
certain address spaces "noalias" (and by inference, "restrict" at the
language level) has a different semantic model than other address spaces?
While it's less worrisome than the first interpretation, I still don't
really like it.

This comes back to the questions about how to handle address space alias
information which there was a long thread about a few months ago. There was
debate over address spaces as a language vs. a target concept, and I don’t
remember right now what the consensuses were there.

I think if OpenCL has special aliasing properties for specific address
spaces it would make much more sense (IMO) to either have a specialized
alias analysis that implements this, or to work out a proper metadata-based
system for expressing these constraints from the frontend.

>
>> Hi Matt,
>>
>>
>>> Both of these I think sort of went in the wrong direction and
>>> talked specifically about the semantics of the atomic
>>> instructions
>>> (fence in particular), which isn't the real question. Is noalias
>>> supposed to mean that no other thread can also have a copy of the
>>> pointer it also modifies? My guess at what was happening is that
>>> since the parameter is noalias, the assumption is there is no
>>> possible way for the side-effecting function to modify the
>>> pointer. The second thread brings up an ambiguity in the C spec
>>> about how restrict is supposed to be interpreted in the presense
>>> of multiple threads. OpenCL still has restrict, but unless this
>>> is
>>> supposed to work, it is pretty close to useless.
>>
>> I checked the OpenCL specification, and it doesn’t give any clear
>> definition of restrict beyond implicitly importing what C99 says.
>> That said, I think it’s is pretty clearly undesirable behavior for
>> CL, even if it may (or may not) be technically permitted by the C
>> specification. I’d be in favor of clarifying our definition of
>> noalias to disallow this transformation.
>
> So we're specifically talking about intrinsics tagged as having
> unmodeled side effects? Or arbitrary functions?

We don't have such a tag for intrinsics to my knowledge.

For intrinsics, it is the lack of any other tag that indicates unmodeled side effects (or arbitrary writes -- an argument could be made for separating these).

-Hal

Sent from my iPhone

>
>> Hi Matt,
>>
>>
>>> Both of these I think sort of went in the wrong direction and
>>> talked specifically about the semantics of the atomic
>>> instructions
>>> (fence in particular), which isn't the real question. Is noalias
>>> supposed to mean that no other thread can also have a copy of the
>>> pointer it also modifies? My guess at what was happening is that
>>> since the parameter is noalias, the assumption is there is no
>>> possible way for the side-effecting function to modify the
>>> pointer. The second thread brings up an ambiguity in the C spec
>>> about how restrict is supposed to be interpreted in the presense
>>> of multiple threads. OpenCL still has restrict, but unless this
>>> is
>>> supposed to work, it is pretty close to useless.
>>
>> I checked the OpenCL specification, and it doesn’t give any clear
>> definition of restrict beyond implicitly importing what C99 says.
>> That said, I think it’s is pretty clearly undesirable behavior for
>> CL, even if it may (or may not) be technically permitted by the C
>> specification. I’d be in favor of clarifying our definition of
>> noalias to disallow this transformation.
>
> So we're specifically talking about intrinsics tagged as having
> unmodeled side effects? Or arbitrary functions?

Arbitrary functions calls can contain barriers ( in CL) or thread
joins (in C), so I think the same reasoning had to apply to them as
well.

Right; in general, I think this is a bad idea. I certainly would not want to apply this logic in general because it will really hurt performance. The transformation really should be safe -- at least in the context of C semantics -- the barrier should not need to synchronize data pointed to by the noalias pointer because no other thread should be accessing data though an aliasing pointer while the function in question is executing (the guarantee noalias makes to the compiler is that it can see all relevant aliasing pointers).

OpenCL is a bit of a different story, perhaps, because of its implicit data-parallel semantics. However, it is also a closed environment (as far as I know). To deal with the problem of arbitrary functions containing barriers, you could add a function attribute and then propagate it in the FunctionAttrs pass.

-Hal

Hi Matt,

Both of these I think sort of went in the wrong direction and talked specifically about the semantics of the atomic instructions (fence in particular), which isn't the real question. Is noalias supposed to mean that no other thread can also have a copy of the pointer it also modifies? My guess at what was happening is that since the parameter is noalias, the assumption is there is no possible way for the side-effecting function to modify the pointer. The second thread brings up an ambiguity in the C spec about how restrict is supposed to be interpreted in the presense of multiple threads. OpenCL still has restrict, but unless this is supposed to work, it is pretty close to useless.

I checked the OpenCL specification, and it doesn’t give any clear definition of restrict beyond implicitly importing what C99 says. That said, I think it’s is pretty clearly undesirable behavior for CL, even if it may (or may not) be technically permitted by the C specification. I’d be in favor of clarifying our definition of noalias to disallow this transformation.

—Owen

I don’t think think outright disallowing this transform is the right solution. This would be valid for OpenCL private or constant address spaces, it’s just global or local where this would be a problem. This comes back to the questions about how to handle address space alias information which there was a long thread about a few months ago. There was debate over address spaces as a language vs. a target concept, and I don’t remember right now what the consensuses were there. The more I think about it, the more an OpenCL specific alias analysis makes sense, which could then use a men_fence intrinsic per address space.

From that discussion we reached a proposal based on metadata similar to TBAA that a frontend must generate describing logical address space and their relationship (disjointness or overlap, and constantness), and like TBAA these would be attached to load/store instructions. Based on these metadata it would be straightforward to build an alias analysis that disallow aliasing between logical disjoint address space and based on disjoint physical address spaces (this would require an hook in TargetTransformInfo).

I haven't had yet the time to implement this grouping the handling of metadata for alias-analysis (TBAA would be a component of this).

-Michele

This sounds right. With the constant address space, anything you do is OK since it’s constant. Private address space is supposed to be totally inaccessible from other workitems, so parallel modifications aren’t a concern. The others require explicit synchronization which noalias would need to be aware of.

FWIW, it seems generally useful to me to have a nomemfence function attribute and intrinsic property. We should avoid memory optimization (and possibly other optimization) across these regardless of alias analysis.

-Andy

Perhaps you're instead trying to say that with certain address spaces "noalias" (and by inference, "restrict" at the language level) has a different semantic model than other address spaces? While it's less worrisome than the first interpretation, I still don't really like it.

This sounds right. With the constant address space, anything you do is OK since it’s constant. Private address space is supposed to be totally inaccessible from other workitems, so parallel modifications aren’t a concern. The others require explicit synchronization which noalias would need to be aware of.

FWIW, it seems generally useful to me to have a nomemfence function attribute and intrinsic property. We should avoid memory optimization (and possibly other optimization) across these regardless of alias analysis.

There are at least two other kinds of optimizations that I know of that are either invalid or can result in (sometimes significantly) slower code when running OpenCL-style SPMD kernels.

The first is tail duplication that takes something like:
  if (x) {
    …
  } else {
    …
  }
  barrier()
and duplicates the barrier into both sides. This can cause hangs, and perhaps other symptoms depending on exactly how this is compiled, and the particular architecture. This isn’t unique to barrier(). Any intrinsic which is effectively “horizontal”, working across work-items, can potentially result in problems and/or different behavior if duplicated or even moved. It looks like a function attribute “noduplicate" exists to block duplication, but I don’t see anything specifically built to block movement, although perhaps that isn’t happening in practice.

The second is loop unswitching:
  for (…) {
    ...
    if (x) { // x is some expression that is loop invariant
      …
    } else {
      …
    }
    ...
  }
where the resulting code has the if condition outside, and a copy of the loop (with possibly some code elided) in each of the ‘then’ and ‘else’ sides. This can result in running the loop twice, once for work-items where ‘x’ is true, and once for work-items where ‘x’ is false.

This is a fairly special case - offhand I cannot think of other transformations that would have similar effect so perhaps the answer here is just “don’t do that” if you’re compiling SPMD kernels.

Mark

I'm think I'll try implementing this. Ideally it would be parameterized over the address space, so it makes more sense for it to be a memfence attribute rather than a nomemfence. You would then have an arbitrary number of memfence(N) attributes for each required address space.

So for correctness, would we need to tag all functions with memfence(0..M) until we can prove otherwise? That seem heinous. Better to have an optional attribute that can be added to expose optimization. Is it important in practice to optimize the case of memfence(I) + nomemfence(J)? If so, is there a problem with nomemfence(N)?

-Andy

Perhaps you're instead trying to say that with certain address spaces "noalias" (and by inference, "restrict" at the language level) has a different semantic model than other address spaces? While it's less worrisome than the first interpretation, I still don't really like it.

This sounds right. With the constant address space, anything you do is OK since it’s constant. Private address space is supposed to be totally inaccessible from other workitems, so parallel modifications aren’t a concern. The others require explicit synchronization which noalias would need to be aware of.

FWIW, it seems generally useful to me to have a nomemfence function attribute and intrinsic property. We should avoid memory optimization (and possibly other optimization) across these regardless of alias analysis.

I'm think I'll try implementing this. Ideally it would be parameterized over the address space, so it makes more sense for it to be a memfence attribute rather than a nomemfence. You would then have an arbitrary number of memfence(N) attributes for each required address space.

So for correctness, would we need to tag all functions with memfence(0..M) until we can prove otherwise? That seem heinous.

I was thinking the absence of it would mean no memfence in any address space, which is the current behavior. This adds the option of fencing.

Better to have an optional attribute that can be added to expose optimization. Is it important in practice to optimize the case of memfence(I) + nomemfence(J)?

I think it would be important for the GPU case. You never need a memfence for private address space / addrspace 0, but you frequently want them for local or global. The local or global writes can't be reordered, but it could be very beneficial to move the private accesses across fences which might help reduce register usage.

  If so, is there a problem with nomemfence(N)?

nomemfence is the current assumption made on an arbitrary call, and it's the common case. Specifying the absence of a fence seems backwards of how this is used and more cumbersome to deal with. To match the current behavior, it would require littering nomemfence for any possible address space everywhere. In OpenCL you specify your fences, so it would be more straightforward to map that. If I have a memfence intrinsic, I just need to mark it with the fence attribute, and then propogate it to its callers. There would generally only be a few of them in any program compared to fenceless calls. To implement this with nomemfence, I would have to mark every function with at least 4 nomemfences, and remove them when encountering the memfence intrinsic.

Sure, but the program still needs to be correct if you skip attribute propagation.
-Andy

Is that actually a real concern? My main problem with nomemfence is how do you mark a function as not fencing any other address space you might care about around call sites? I suppose nomemfence without an address space could indicate nomemfence for any address space, but then that just restricts the problem to when you do have a few fenced address spaces. How do you know what other address spaces are relevant to be marked? Add a nomemfence for any address spaces encountered in functions with call sites? What if those in another module?

Is this a requirement for an attribute? This would be a problem for the already existing noduplicate. If a function has a call to a noduplicate function, the calling function could still be duplicated if the attribute isn’t propagated which isn’t allowed.

  • Matt

Others can weigh in here. This is just my understanding. Attribute propagation has to be optional because we can’t assume inter-procedural optimization runs for correct codegen. What if the memfence resides in a different module?

In the case of noduplicate, the only reason to propagate AFAICT would be to suppress inlining. It seems reasonable enough to expect attribute propagation to happen before inlining. So I don’t think noduplicate is an issue in practice.

I think “memfence” could be an issue if we use the attribute to summarize LLVM atomic load/store and fence instructions (in addition to OpenCL barriers).

If the semantics you are proposing won’t apply to general memory ordering constraints, then at least the name should be changed to specifically refer to OpenCL barriers.

-Andy