[PATCH 1/3] Implement wait_group_events builtin v2

This is a simple default implemetation which just calls barrier().

v2:
  - Only call barrier() once.

This is a simple implementation which just copies data synchronously.

v2:
  - Use size_t.

This is a simple implementation which just copies data synchronously.

v2:
  - Use size_t.

v3:
  - Fix possible race condition by splitting the copy among multiple
    work items.

LGTM.

Jeroen

LGTM

Jeroen

This looks good, but I think it suffices to use a barrier with an empty set of flags, as the function only requires the copy to have completed. Not that every memory operation arising from the copy has been committed, as I wrote before.

The above is provided that a barrier call is not removed completely in case of an empty set of flags, which I recall is currently the case for R600. I think there was a patch for that at some point; not sure what happened to that patch though.

Jeroen

This looks good, but I think it suffices to use a barrier with an empty
set of flags, as the function only requires the copy to have completed.
Not that every memory operation arising from the copy has been
committed, as I wrote before.

The above is provided that a barrier call is not removed completely in
case of an empty set of flags, which I recall is currently the case for
R600. I think there was a patch for that at some point; not sure what
happened to that patch though.

if you are referring to [0] and the discussion that stemmed from it, it
never got reviewed.

I agree with your conclusion that we need to extend fence with AS
argument to get this done properly. But it looked like too big of a task
at that time, so I did not push it further.

I still plan to return to it (unless someone beats me to it), but right
now there are things higher on my TODO list (Mostly to familiarize
myself with the memory code) before getting back to fences.

sorry that I left that discussion in limbo,
jan

[0] http://www.pcc.me.uk/pipermail/libclc-dev/2014-April/000284.html

This looks good, but I think it suffices to use a barrier with an empty set of flags, as the function only requires the copy to have completed. Not that every memory operation arising from the copy has been committed, as I wrote before.

I don't quite understand the difference here, can you give me an
example?

This looks good, but I think it suffices to use a barrier with an empty
set of flags, as the function only requires the copy to have completed.
Not that every memory operation arising from the copy has been
committed, as I wrote before.

The above is provided that a barrier call is not removed completely in
case of an empty set of flags, which I recall is currently the case for
R600. I think there was a patch for that at some point; not sure what
happened to that patch though.

if you are referring to [0] and the discussion that stemmed from it, it
never got reviewed.

I was referring to the more recent:

  http://www.pcc.me.uk/pipermail/libclc-dev/2014-August/000586.html

Although the also had some issues.

Jeroen

This looks good, but I think it suffices to use a barrier with an empty set of flags, as the function only requires the copy to have completed. Not that every memory operation arising from the copy has been committed, as I wrote before.

I don't quite understand the difference here, can you give me an
example?

See the top of this page:

https://www.khronos.org/message_boards/showthread.php/5875-Profiling-Code/page2

Thinking a bit on how this might impact the assembly generated for SI, I must admit I’m not quite sure. First I was thinking that this implied that I could write something like the following in pseudo-SI assembly when copying from local to global memory (for brevity I’m omitting any loop that might be needed):

  ds_read
  s_waitcnt lgkmcnt(0)
  buffer_store
  s_barrier

but you might argue that a

  s_waitcnt vm_cnt(0)

is needed before the s_barrier, because only then is the copy complete (in the sense that we can use the used VGPRs for something else). In any case barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE); is definitely the safe thing to do.

Jeroen

This looks good, but I think it suffices to use a barrier with an empty
set of flags, as the function only requires the copy to have completed.
Not that every memory operation arising from the copy has been
committed, as I wrote before.

The above is provided that a barrier call is not removed completely in
case of an empty set of flags, which I recall is currently the case for
R600. I think there was a patch for that at some point; not sure what
happened to that patch though.

if you are referring to [0] and the discussion that stemmed from it, it
never got reviewed.

I was referring to the more recent:

   http://www.pcc.me.uk/pipermail/libclc-dev/2014-August/000586.html

Although the also had some issues.

Jeroen

Just to let you know, I had not much time to devote to this in the last two months because spare-time have been a really scarce resource (I moved to switzerland, finished an intership, started a new one, became a graduate student, etc: that kind of stuff...). But I don't give up, and hope that I will be able to continue my work on that because this is really important.

Regards,
Damien.

>>
>> This looks good, but I think it suffices to use a barrier with an empty set of flags, as the function only requires the copy to have completed. Not that every memory operation arising from the copy has been committed, as I wrote before.
>>
>
> I don't quite understand the difference here, can you give me an
> example?

See the top of this page:

https://www.khronos.org/message_boards/showthread.php/5875-Profiling-Code/page2

Thinking a bit on how this might impact the assembly generated for SI, I must admit I’m not quite sure. First I was thinking that this implied that I could write something like the following in pseudo-SI assembly when copying from local to global memory (for brevity I’m omitting any loop that might be needed):

  ds_read
  s_waitcnt lgkmcnt(0)
  buffer_store
  s_barrier

but you might argue that a

  s_waitcnt vm_cnt(0)

is needed before the s_barrier, because only then is the copy complete (in the sense that we can use the used VGPRs for something else). In any case barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE); is definitely the safe thing to do.

Ok, thanks for the explanation. Since this is a generic implementation,
I think I'll keep the barrier() call as is, since it is the safest thing
to do. We can look at adding an optimized version for SI later.

-Tom