My comments are below.
*From:* email@example.com [mailto:mats.o.petersson@googl
email.com] *On Behalf Of *mats petersson
*Sent:* Friday, January 12, 2018 8:32 AM
*To:* Liu, Yaxun (Sam) <Yaxun.Liu@amd.com>
*Cc:* Anastasia Stulova <Anastasia.Stulova@arm.com>; Sumner, Brian <
Brian.Sumner@amd.com>; cfe-dev (firstname.lastname@example.org) <
email@example.com>; Bader, Alexey (firstname.lastname@example.org) <
email@example.com>; nd <firstname.lastname@example.org>
*Subject:* Re: [cfe-dev] [RFC][OpenCL] Pass alignment of arguments in
local addr space for device-side enqueued kernel to __enqueue_kernel
The workgroup size is usually 64 or 128. The number of workgroups can be
quite large if the global size is large. If for each local array we waste
124 bytes, the total waste could be quite large, considering local memory
is precious resource for GPU.
It is, if you have a local argument that is just 4 bytes. But is that
really typical practical use-case? Doing a local memory allocation in the
first place to store 4 bytes seems a bit excessive.
[Sam] The waste of memory could happen to an integer array of any size,
e.g. int a, which only needs to align at 4 bytes. Aligning it to 128
bytes waste 124 bytes.
Clearly not, in this case, the local argument is 40 bytes, and thus the
wastage is at most 68 bytes (128-40). And I'm not arguing that this is not
wasted, I'm trying to understand what the use-case is where the user uses
local memory in such a small amount per workgroup.
Also, a simplification would be to do something like this:
alignment = min(round_to_neareast_power_of_2(size), max_alignment),
[Sam] We cannot expect how user would use local memory. In certain cases
the above approach still waste considerable local memory. I think it is
better to allow user be able to fully utilize their local memory,
considering the implementation effort is moderate.
With the above suggestion, the implementation cost is nearly zero, and you
CAN assume that the user will not access outside the range of the actual
allocated space [that would be UB]. For an int , the "loss" would be
24 bytes, because the rounding up would be to 64 bytes, and the worst
possible case for small buffers is for int , which would waste 60
bytes. For large buffers, the worst case can of course still be 124 bytes.
You could potentially do something like (I have not validated this - and it
still needs clamping to 128 or something of course)
rounded_size = round_to_nearest_smaller_power_2(size);
if (size != rounded_size)
alignment = size % rounded_size;
alignment = rounded_siize;
This will give you an alignment of 8 for int , and 4 for int .
This does of course assume that someone doesn't try to load 16 of the int
 in a vector-instruction that requires alignment of 64, and then load
one element on its own. That wouldn't work well, but that would only work
if the user-call supplied the alignment, which I don't think is the
Of course, if you have a bunch of different local arguments, of varying
sizes, this will still potentially lead to wasted space, but less so. For
example int , int , int, int , int  would lead to several
gaps of varying sizes. If this is what is expected - and I don't really
know what use cases there are out there that use local memory combined with
device-side enqueue - then I would say, it may be worth doing this.
Have you investigated some work-loads with regard to how much space you
gain from "the tightest possible packing", compared to my above solution,
the one-line solution, and "round everything to 128"?
Without revealing what the work-loads are, perhaps you could show something
Kernel A: 12, 36, 18, 128 bytes
Kernel B: 116, 236, 240, 256 bytes
Kernel C: ...
[I just made those numbers up, and I don't really expect the numbers to
make any sense compared to real applications and numbers]
Not quite a single-line, but still trivial compared to passing and handling
an array of extra arguments, which requires modification of several
different files, adding new test-cases, etc. [Although you may want to add
some test-cases for this implementation, of course].