[PATCH 1/3] amdgcn-amdhsa: Add get_global_size() implementation

I think you should name the labels+values, besides that LGTM

LGTM. I hope the % does get optimized out?

Why keep amdgc— around?

>
>
> +_CLC_DEF size_t get_num_groups(uint dim) {
> + size_t global_size = get_global_size(dim);
> + size_t local_size = get_local_size(dim);
> + size_t num_groups = global_size / local_size;
> + if (global_size % local_size != 0) {
> + num_groups++;
> + }
> + return num_groups;
LGTM. I hope the % does get optimized out?

AMDGPU implements DIVREM for 64bit integers, so / and % are computed at
the same time. It's still rather expensive (~400 instructions).

(global_size + local_size -1) / local_size

allows elimination of REM only parts of DIVREM (although the savings
are negligible).

Jan

>
> +available_targets['amdgcn-mesa-mesa3d'] = available_targets['amdgcn--']

Why keep amdgc— around?

I want to keep it around to make it easier to debug regressions.
I'll remove it once I remove the amdgcn-- support from LLVM.

-Tom

LGTM

> >
> >
> > +_CLC_DEF size_t get_num_groups(uint dim) {
> > + size_t global_size = get_global_size(dim);
> > + size_t local_size = get_local_size(dim);
> > + size_t num_groups = global_size / local_size;
> > + if (global_size % local_size != 0) {
> > + num_groups++;
> > + }
> > + return num_groups;
> LGTM. I hope the % does get optimized out?

AMDGPU implements DIVREM for 64bit integers, so / and % are computed at
the same time. It's still rather expensive (~400 instructions).

(global_size + local_size -1) / local_size

allows elimination of REM only parts of DIVREM (although the savings
are negligible).

I would really prefer to have the runtime compute this, so I think we
can replace this with something better in the future.

-Tom

>
> >
> > >
> > >
> > >
> > > +_CLC_DEF size_t get_num_groups(uint dim) {
> > > + size_t global_size = get_global_size(dim);
> > > + size_t local_size = get_local_size(dim);
> > > + size_t num_groups = global_size / local_size;
> > > + if (global_size % local_size != 0) {
> > > + num_groups++;
> > > + }
> > > + return num_groups;
> > LGTM. I hope the % does get optimized out?
>
> AMDGPU implements DIVREM for 64bit integers, so / and % are
> computed at
> the same time. It's still rather expensive (~400 instructions).
>
> (global_size + local_size -1) / local_size
>
> allows elimination of REM only parts of DIVREM (although the
> savings
> are negligible).
>

I would really prefer to have the runtime compute this, so I think we
can replace this with something better in the future.

what's the benefit/reason? even if the computation did not involve
division it would need more than 3 registers/parameters to compute.

Jan