[RFC] Unified offloading option for CUDA/HIP/OpenMP

[AMD Public Use]

Currently CUDA/HIP and OpenMP has different offloading options, e.g.

clang++ -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx900 test.cpp

clang++ -offload-arch=gfx906 test.hip

Our users request to have a concise way to specify offloading options for OpenMP. Ideally, one option to convey offloading kind, offloading triple, and offloading device arch.

On the other hand, there are some limitations of the current offloading option for CUDA/HIP:

1. It does not specify offloading kind whereas relies on file type to infer offloading kind. If input file is not CUDA/HIP source code (e.g. bundled LLVM bit code), there needs a way to specify offloading kind.

2. It does not specify offloading target triple whereas relies on device arch to infer target triple. As HIP is ported to different targets, there needs a way to specify offloading target triple.

In summary, a unified offloading option is preferred, which conveys offloading kind, offloading target triple and offloading device arch.

I would like to propose to either have a new option or extend the existing -offload-arch option for that, in the format kind-triple-arch, e.g.

-offload=omp-amd-gfx900

-offload=hip-amd-gfx906

Whereas kind and triple can be abbreviations for conciseness, e.g. omp expands to openmp, amd expands to amdgcn-amd-amdhsa. Arch can be omitted, in which case clang will use the default arch for the triple.

Your feedbacks are welcome.

Thanks.

Sam

Why do we need “kind” here? We know the kind already, if -fopenmp is used - kind is omp, if .hip is compiled - kind is hip.
Adding arch also does not look good. We may need to pass some extra params. Better to have something like `-offload=target1,target2,... -offload-target1=“-march xxx -opt1” -offload-target2=“-march yyy -opt2 -opt3” ...`.

Best regards,
Alexey Bataev

[AMD Public Use]

The offload kind is intended for situations where offload kind cannot be inferred from input file type, e.g. bundled LLVM bitcode or bundled assembly file. If it is redundant to specify it multiple times, we may consider a separate -offload-kind= option. I understand OpenMP can use -fopenmp to indicate OpenMP offload kind, but HIP may need an -offload-kind=hip option since it currently does not have one.

`-offload=target1,target2,... -offload-target1=“-march xxx -opt1” -offload-target2=“-march yyy -opt2 -opt3” does not well work for CUDA/HIP since CUDA/HIP can have multiple device archs for the same target.

How about -offload=target1,arch1,opt1a,opt1b -offload=target2,arch2,opt2a,opt2b ?

Thanks.

Sam

I still don’t understand why do we need kind but -offload-kind=kind looks much better to me.
As to archs, I think the arch can be included in triple <arch><sub>-<vendor>-<sys>-<abi>

You can specify sub target amdgcngfx906-amd-amdhsa but you need to add subarchs for triple class. How about this?

https://clang.llvm.org/docs/CrossCompilation.html

[AMD Public Use]

When I talk about device arch, I mean the option passed by -mcpu. My understanding is that is intended for a variation of the in triple, which covers a set of CPU’s. I don’t think it is proper to treat each of -mcpu option as a variation of the .

Sam

Hi Sam, thanks for driving this, I really like the idea!

Here are some thoughts:

Make the "kind" optional if it can be deduced, as Alexey noted. So
-offload=amd-gfx900
should work fine if we have an -x c and -fopenmp set. Error out if
it is ambiguous.

Allow multiple -offload occurrences.

Keep the support of the old ways for now as well.

Allow to pass the kind + triple + arch as first part of the new -offload
flag and any options as a second part, so:
-offload=hip-amd-gfx906
works but also does
-offload="amd-gfx906 -fvectorize" -x hip
as well as
-offload="amd -march=gfx906 -fno-vectorize" -fopenmp
This will make it way easier to use.

I hope some of these make some sense :slight_smile:

~ Johannes

Wouldn't having flags like `-Xoffload_amd` to contextualize the next
argument better match the behavior with multiple `-arch` flags using
`-Xarch_aarch64`?

--Ben

Allow to pass the kind + triple + arch as first part of the new -offload
flag and any options as a second part, so:
   -offload=hip-amd-gfx906
works but also does
   -offload="amd-gfx906 -fvectorize" -x hip
as well as
   -offload="amd -march=gfx906 -fno-vectorize" -fopenmp
This will make it way easier to use.

Wouldn't having flags like `-Xoffload_amd` to contextualize the next
argument better match the behavior with multiple `-arch` flags using
`-Xarch_aarch64`?

Hm, I find my syntax more user friendly, not to say we could not
accept both for consistency.

~ Johannes

Hi Sam, thanks for driving this, I really like the idea!

Here are some thoughts:

Make the “kind” optional if it can be deduced, as Alexey noted. So
-offload=amd-gfx900
should work fine if we have an -x c and -fopenmp set. Error out if
it is ambiguous.

Allow multiple -offload occurrences.

Keep the support of the old ways for now as well.

Allow to pass the kind + triple + arch as first part of the new -offload
flag and any options as a second part, so:
-offload=hip-amd-gfx906
works but also does
-offload=“amd-gfx906 -fvectorize” -x hip
as well as
-offload=“amd -march=gfx906 -fno-vectorize” -fopenmp
This will make it way easier to use.

Naming things consistently is hard. We need to consider that we’ll need to pass an arbitrary complex set of options for each offload instance, whatever it may be. There may be a lot of options per individual offload instance that would differ only minimally. Having to repeat all of them will be tedious at best. We may also need to pass options further down the compilation stack. E.g. we may want different ptxas options for each CUDA target.

Perhaps we could enhance the option parser to create a notion of argument scope? “Arguments in a string” approach sort of does it already in a limited way, but it would still need CLI parser changes to handle the options consistently.
If we implement one scope level, making it hierarchical should not be that much harder.

Having such CLI model would allow us to do thing like this:
–offload=hip-gfx9* --something-common-to-all-gfx9xx targets>
–offload=hip-gfx999 --something-for-gfx999-only
–offload-end 2 // pops two levels of CLI scopes. 1 level if no argument is given.

The identifier could be a regex/glob match on an arbitrary string. We don’t need it to carry any specific paramenters itself, they should just be meaningful enough for the parts of the code that care about particular scope to provide their ‘scope string’ to match against.
I.e. for example above, HIP toolchain would set CLI scope(s) to be hip-gfx999. There will be implicit top-level ‘–offload=.*’ which would always match and then the parser would reparse the options taking into account only the matching scopes. This could allow us to specify both OMP and CUDA/HIP options for the same compilation – we could conceivably benefit from OMP offload to multiple threads in the host-side compilation.

It all may be an utter overkill, too. WDYT?

–Artem

My primary concern is around the quoting messiness your approach is
likely to involve. POSIX shell rules make sense, but I can never
remember cmd.exe quoting rules. Add in trying to reliably getting the
quotes plumbed through CMake or other build system code…

--Ben

Note that tools such as ccache and sccache generally need to be able to
understand what's going on (I believe distcc and other distributed
compilation tools also generally need to know too), so making it
sensible enough for interpretation based on just the flags to be
possible should be considered.

For prior art, there are scoping operators for Linux's `ld`:

  - `-(` and `-)` (`--start-group` and `--end-group`)
  - stateful flags like `--no-as-needed` and `--as-needed` that change
    how future arguments are interpreted

--Ben

It all may be an utter overkill, too. WDYT?

Note that tools such as ccache and sccache generally need to be able to
understand what’s going on (I believe distcc and other distributed
compilation tools also generally need to know too), so making it
sensible enough for interpretation based on just the flags to be
possible should be considered.

I think this is somewhat orthogonal to how we specify per-target options. Such a tool almost never knows about all possible compiler options and has to pass through the unknown options as-is. However, any form of ‘nested’ options specified on the command line will have a chance to confuse such tool. E.g. if I want to pass ‘-E’ to some sub-tool for a particular offload-target, ccache, not being aware that it’s not a top-level compilation option, may interpret it as an attempt to preprocess the TU.

I wonder if it would make sense to just move all this per-target option complexity into an external response file. As far as existing tools are concerned, it would look like --offload-options=target-opts.file without affecting tool’s general idea what this compilation is about to do, and the external file would allow us to be as flexible as we need to be to specify per-target options. It could be just a flat list of pairs -Xarch_... optA. Or we could use YAML.

That approach, however, has its own issues and would still need to be optional. If it’s the only way to specify offload options, that will complicate other use cases as now they would have to deal with temporary files.

Maybe a slightly modified variant of jdoefert@'s idea would work better:

-offload=“amd -march=gfx906 -fno-vectorize” -fopenmp

Implement it in a way similar to -Wl,optA,optB,optC and extend it to match an offload scope glob/regex.
E.g. -offload=<offload-pattern>,optA,optB,optC.
As far as the external tools are concerned, it’s just one option to pass though. At the same time it should be flexible enough to apply the options to subset of offload targets in a human-manageable way.

[AMD Public Use]

Sorry for the delay.

Both Johannes’ and Artem’s proposals should satisfy the needs of users:

Option 1:

-offload=<offload-pattern> optA optB optC.

Option 2:

-offload=<offload-pattern>,optA,optB,optC.

Compared to the old options, they are more concise and more readable.

The main difference is the delimiter. To me option 2 is more attractive since it does not need quotations for most cases.

Can we reach an agreement on option 2?

Thanks.

Sam

[AMD Public Use]

Sorry for the delay.

Both Johannes’ and Artem’s proposals should satisfy the needs of users:

Option 1:

-offload=<offload-pattern> optA optB optC.

Option 2:

-offload=<offload-pattern>,optA,optB,optC.

I’m fine with #2. We’re using something similar with our build tools and it works reasonably well.
However, it does have one annoying corner case. There’s no easy way to pass an option which has a comma in it. E.g. if I want to pass -Wl,something,something. Perhaps we could use sed-like approach and allow changing the separator. E.g. s/a/b/ == s@a@b@.

–Artem

I'm OK with either.

[AMD Public Use]

There is another aspect we need to consider: how to modify the -target option by additional options?

For the existing --offload-arch option, we could use -Xarch_ to add specific options for it.

Assuming we have an -offload="amdgcn -mcpu=gfx906" option, then we want to add some options specific to it by an additional option, what should we do?

Thanks.

Sam

[AMD Public Use]

There is another aspect we need to consider: how to modify the -target option by additional options?

For the existing --offload-arch option, we could use -Xarch_ to add specific options for it.

-Xarch_xxx as implemented right now is a rather limiter hack. IIRC it only accepts options w/o arguments which limits its usability.

Assuming we have an -offload=“amdgcn -mcpu=gfx906” option, then we want to add some options specific to it by an additional option, what should we do?

I think we’ve been conflating telling the driver what to compile for and customizing individual sub-compilations.

We could explicitly separate the two tasks. E.g.:
--[no-]offload=target1,target2,target3...
--Xoffload=target_pattern target_options...

This way your example would be handled with:
“–offload=gfx906,gfx1010”
“–Xoffload=gfx* options common to all AMD GPUs”
“–Xoffload=gfx906 -mcpu=gfx906 --fsomething-specific-to-gfx906”

In the end -Xarch_xxx would become an alias for ‘-Xoffload=xxx’.

–Artem

[AMD Public Use]

There is another aspect we need to consider: how to modify the -target
option by additional options?

For the existing --offload-arch option, we could use -Xarch_ to add
specific options for it.

`-Xarch_xxx` as implemented right now is a rather limiter hack. IIRC it
only accepts options w/o arguments which limits its usability.

Assuming we have an -offload="amdgcn -mcpu=gfx906" option, then we want
to add some options specific to it by an additional option, what should we
do?

I think we've been conflating telling the driver what to compile for and
customizing individual sub-compilations.

We could explicitly separate the two tasks. E.g.:
`--[no-]offload=target1,target2,target3...`
`--Xoffload=target_pattern target_options...`

This way your example would be handled with:
"--offload=gfx906,gfx1010"
"--Xoffload=gfx* options common to all AMD GPUs"
"--Xoffload=gfx906 -mcpu=gfx906 --fsomething-specific-to-gfx906"

In the end `-Xarch_xxx` would become an alias for '-Xoffload=xxx'.

+1

[AMD Public Use]

We need to different target triples since it may not always be possible to infer target triple by cpu name. So I guess it would be like:

"--offload=amdgcn-gfx906,amdgcn-gfx1010"
"--Xoffload=amdgcn-gfx* options common to all AMD GPUs"
"--Xoffload=amdgcn-gfx906 -mcpu=gfx906 --fsomething-specific-to-gfx906"

Sam

[AMD Public Use]

We need to different target triples since it may not always be possible to infer target triple by cpu name. So I guess it would be like:

“–offload=amdgcn-gfx906,amdgcn-gfx1010”
“–Xoffload=amdgcn-gfx* options common to all AMD GPUs”
“–Xoffload=amdgcn-gfx906 -mcpu=gfx906 --fsomething-specific-to-gfx906”

SGTM.
Do you expect the AMDGPU’s features (+xnack, -ecc, etc) to be part of the offload target ? Or would they be specified via -Xoffload arguments?

–Artem