Declare variant + Nvidia Device Offload

Hi all,

what's the current status of declare variant when compiling for Nvidia
GPUs?

In my code, I have declared a variant of a function, that uses CUDA's
built-in atomicAdd (using the syntax from OpenMP TR8):

#pragma omp begin declare variant match(device={kind(nohost)})

void atom_add(double* address, double val){
atomicAdd(address, val);
}

#pragma omp end declare variant

When compiling with Clang from master, ptxas fails:

clang++ -fopenmp -O3 -std=c++11 -fopenmp
-fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target -march=sm_72 -v
[...]
ptxas kernel-openmp-nvptx64-nvidia-cuda.s, line 322; fatal : Parsing
error near '.ompvariant': syntax error
ptxas fatal : Ptx assembly aborted due to errors
[...]
clang-11: error: ptxas command failed with exit code 255 (use -v to
see invocation)

The line mentioned in the ptxas error looks like this:

    // \.globl       \_Z33atom\_add\.ompvariant\.S2\.s6\.PnohostPdd

.visible .func _Z33atom_add.ompvariant.S2.s6.PnohostPdd(
.param .b64 _Z33atom_add.ompvariant.S2.s6.PnohostPdd_param_0,
.param .b64 _Z33atom_add.ompvariant.S2.s6.PnohostPdd_param_1
)
{

My guess was that ptxas stumbles across the ".ompvariant"-part of the
mangled function name.

Is declare variant currently supported when compiling for Nvidia GPUs?
If not, is there a workaround (macro defined only for device
compilation, access to the atomic CUDA functions, ...)?

Thanks in advance,

Best

Lukas

Oh, I forgot about this one.

The math stuff works because all declare variant functions are static.

I think if we need to replace the . with a symbol that the user cannot

use but the ptax assembler is not upset about. we should also move

getOpenMPVariantManglingSeparatorStr from Decl.h into

llvm/lib/Frontends/OpenMP/OMPContext.h, I forgot why I didn’t.

You should also be able to use the clang builtin atomics and even the

omp atomic should eventually resolve to the same thing (I hope).

Let me know if that helps,

Johannes

Hi Johannes,

thanks four your quick reply!

The math stuff works because all declare variant functions are static.

I think if we need to replace the . with a symbol that the user cannot

use but the ptax assembler is not upset about. we should also move

getOpenMPVariantManglingSeparatorStr from Decl.h into

llvm/lib/Frontends/OpenMP/OMPContext.h, I forgot why I didn’t.

The . also seems to be part of the mangled context. Where does that mangling take place?

According to the PTX documentation [0], identifiers cannot contain dots, but $ and % are allowed in user-defined names (apart from a few predefined identifiers).

Should we replace the dot only for Nvidia devices or in general? Do any other parts of the code rely on the mangling of the variants with dots?

You should also be able to use the clang builtin atomics

You were referring to , weren’t you? As far as I can see, those only work on atomic types. From what I can see in the generated LLVM IR, this does not seem to be the case. Maybe that’s due to the fact, that I’m using update or structs (for more context, see [1]): In the generated LLVM IR, there are a number of atomic loads and an atomicrmw in the end, but no calls to CUDA builtins.

Hi Johannes,

>
> thanks four your quick reply!
>
>> The math stuff works because all declare variant functions are static.
>>
>> I think if we need to replace the `.` with a symbol that the user cannot
>>
>> use but the ptax assembler is not upset about. we should also move
>>
>> `getOpenMPVariantManglingSeparatorStr` from `Decl.h` into
>>
>> `llvm/lib/Frontends/OpenMP/OMPContext.h`, I forgot why I didn't.
>>
> The `.` also seems to be part of the mangled context. Where does that
> mangling take place?

OMPTraitInfo::getMangledName() in OpenMPClause.{h,cpp} (clang)

I guess it doesn't live in OMPContext because it uses the user defined
structured trait set to create the mangling. Right now I don't see a
reason not to use the VariantMatchInfo and move everything into
OMPContext. Though, no need to do this now.

> According to the PTX documentation [0], identifiers cannot contain dots,
> but `$` and `%` are allowed in user-defined names (apart from a few
> predefined identifiers).
>
> Should we replace the dot only for Nvidia devices or in general? Do any
> other parts of the code rely on the mangling of the variants with dots?

Yes, we need to replace the dot with either of the symbols you
mentioned. If we can use the same symbols on the host, I'm fine with
chaining it unconditionally.

Except the test, I don't think we need to adapt anything else.

FWIW, OpenMP 6.0 will actually define the mangling.

>> You should also be able to use the clang builtin atomics
> You were referring to
> https://clang.llvm.org/docs/LanguageExtensions.html#c11-atomic-builtins,
> weren't you? As far as I can see, those only work on atomic types.

I meant:
http://llvm.org/docs/Atomics.html#libcalls-atomic

>> `omp atomic` should eventually resolve to the same thing (I hope).
> From what I can see in the generated LLVM IR, this does not seem to be
> the case. Maybe that's due to the fact, that I'm using update or structs
> (for more context, see [1]):
>
>> #pragma omp atomic update
>> target_cells_[voxelIndex].mean[0] += (double) target_[id].data[0];
>> #pragma omp atomic update
>> target_cells_[voxelIndex].mean[1] += (double) target_[id].data[1];
>> #pragma omp atomic update
>> target_cells_[voxelIndex].mean[2] += (double) target_[id].data[2];
>> #pragma omp atomic update
>> target_cells_[voxelIndex].numberPoints += 1;
> In the generated LLVM IR, there are a number of atomic loads and an
> atomicrmw in the end, but no calls to CUDA builtins.
>
> The CUDA equivalent of this target region uses calls to atomicAdd and
> according to nvprof, this is ~10x faster than the code generated for the
> target region by Clang (although I'm not entirely sure the atomics are
> the only problem here).

I see. This is a performance bug that need to be addressed. I'm just not sure where exactly.

> Best,
>
> Lukas
>
> [0]
> PTX ISA :: CUDA Toolkit Documentation
>
> [1]
> daphne-benchmark/kernel.cpp at 054bbd723dfdf65926ef3678138c6423d581b6e1 · esa-tu-darmstadt/daphne-benchmark · GitHub

Hi Johannes,

Yes, we need to replace the dot with either of the symbols you
mentioned. If we can use the same symbols on the host, I'm fine with
chaining it unconditionally.

I've prototyped this change locally, replacing the dot with a
dollar-sign in the places you mentioned. With this change, declaring
variants for offloading to CUDA devices works for my testcase in my
setup (ARM + Nvidia SM 7.2).

If you want, I could create a patch for that, but do you (or anyone else
on the list) know whether the other offloading targets (AMDGPU, NEC VE,
...) are all able to handle the dollar sign in the mangled name, so we
could make this change unconditionally? I do not have a setup for each
of these to test.

I was also wondering if legalizing the mangled name shouldn't be handled
by the backend (NVPTX in this case) instead of the OpenMP-specific parts.

I see. This is a performance bug that need to be addressed. I'm just
not sure where exactly.

Now that I was able to use atomicAdd with declare variant, I can confirm
that the performance improves significantly with atomicAdd. I will try
to get some more detailed numbers from nvprof.

Best,

Lukas

Lukas Sommer, M.Sc.
TU Darmstadt
Embedded Systems and Applications Group (ESA)
Hochschulstr. 10, 64289 Darmstadt, Germany
Phone: +49 6151 1622429
www.esa.informatik.tu-darmstadt.de

Hi Lukas,

Hi Johannes,

>
>> Yes, we need to replace the dot with either of the symbols you
>> mentioned. If we can use the same symbols on the host, I'm fine with
>> chaining it unconditionally.
> I've prototyped this change locally, replacing the dot with a
> dollar-sign in the places you mentioned. With this change, declaring
> variants for offloading to CUDA devices works for my testcase in my
> setup (ARM + Nvidia SM 7.2).
>
> If you want, I could create a patch for that, but do you (or anyone else
> on the list) know whether the other offloading targets (AMDGPU, NEC VE,
> ...) are all able to handle the dollar sign in the mangled name, so we
> could make this change unconditionally? I do not have a setup for each
> of these to test.

Please prepare a patch. People can then test it, e.g., AMD folks, but I
expect it to work fine and us adopting it :slight_smile:

> I was also wondering if legalizing the mangled name shouldn't be handled
> by the backend (NVPTX in this case) instead of the OpenMP-specific parts.

I doubt it is reasonable to change symbol names once they are defined.
It would work here "by accident" but with the OpenMP 6.0 mangling this
will be in fact illegal.

>> I see. This is a performance bug that need to be addressed. I'm just
>> not sure where exactly.
> Now that I was able to use atomicAdd with declare variant, I can confirm
> that the performance improves significantly with atomicAdd. I will try
> to get some more detailed numbers from nvprof.
>

Cool! I'm glad the declare variant begin/end works and exposing the CUDA
intrinsics is useful/working too :slight_smile:

> Best,
>
> Lukas
>
> Lukas Sommer, M.Sc.
> TU Darmstadt
> Embedded Systems and Applications Group (ESA)
> Hochschulstr. 10, 64289 Darmstadt, Germany
> Phone: +49 6151 1622429
> www.esa.informatik.tu-darmstadt.de
>
>>
>>> Hi Johannes,
>>>
>>> thanks four your quick reply!
>>>
>>>> The math stuff works because all declare variant functions are static.
>>>>
>>>> I think if we need to replace the `.` with a symbol that the user
>> cannot
>>>>
>>>> use but the ptax assembler is not upset about. we should also move
>>>>
>>>> `getOpenMPVariantManglingSeparatorStr` from `Decl.h` into
>>>>
>>>> `llvm/lib/Frontends/OpenMP/OMPContext.h`, I forgot why I didn't.
>>>>
>>> The `.` also seems to be part of the mangled context. Where does that
>>> mangling take place?
>>
>> OMPTraitInfo::getMangledName() in OpenMPClause.{h,cpp} (clang)
>>
>> I guess it doesn't live in OMPContext because it uses the user defined
>> structured trait set to create the mangling. Right now I don't see a
>> reason not to use the VariantMatchInfo and move everything into
>> OMPContext. Though, no need to do this now.
>>
>>> According to the PTX documentation [0], identifiers cannot contain
>> dots,
>>> but `$` and `%` are allowed in user-defined names (apart from a few
>>> predefined identifiers).
>>>
>>> Should we replace the dot only for Nvidia devices or in general? Do any
>>> other parts of the code rely on the mangling of the variants with dots?
>>
>> Yes, we need to replace the dot with either of the symbols you
>> mentioned. If we can use the same symbols on the host, I'm fine with
>> chaining it unconditionally.
>>
>> Except the test, I don't think we need to adapt anything else.
>>
>> FWIW, OpenMP 6.0 will actually define the mangling.
>>
>>>> You should also be able to use the clang builtin atomics
>>> You were referring to
>>>
>> https://clang.llvm.org/docs/LanguageExtensions.html#c11-atomic-builtins,
>>> weren't you? As far as I can see, those only work on atomic types.
>>
>> I meant:
>> http://llvm.org/docs/Atomics.html#libcalls-atomic
>>
>>>> `omp atomic` should eventually resolve to the same thing (I hope).
>>> From what I can see in the generated LLVM IR, this does not seem to be
>>> the case. Maybe that's due to the fact, that I'm using update or
>> structs
>>> (for more context, see [1]):
>>>
>>>> #pragma omp atomic update
>>>> target_cells_[voxelIndex].mean[0] += (double) target_[id].data[0];
>>>> #pragma omp atomic update
>>>> target_cells_[voxelIndex].mean[1] += (double) target_[id].data[1];
>>>> #pragma omp atomic update
>>>> target_cells_[voxelIndex].mean[2] += (double) target_[id].data[2];
>>>> #pragma omp atomic update
>>>> target_cells_[voxelIndex].numberPoints += 1;
>>> In the generated LLVM IR, there are a number of atomic loads and an
>>> atomicrmw in the end, but no calls to CUDA builtins.
>>>
>>> The CUDA equivalent of this target region uses calls to atomicAdd and
>>> according to nvprof, this is ~10x faster than the code generated for
>> the
>>> target region by Clang (although I'm not entirely sure the atomics are
>>> the only problem here).
>>
>> I see. This is a performance bug that need to be addressed. I'm just
>> not sure where exactly.
>>
>>> Best,
>>>
>>> Lukas
>>>
>>> [0]
>>>
>> PTX ISA :: CUDA Toolkit Documentation
>>>
>>> [1]
>>>
>> daphne-benchmark/kernel.cpp at 054bbd723dfdf65926ef3678138c6423d581b6e1 · esa-tu-darmstadt/daphne-benchmark · GitHub
>>>
>>> Lukas Sommer, M.Sc.
>>> TU Darmstadt
>>> Embedded Systems and Applications Group (ESA)
>>> Hochschulstr. 10, 64289 Darmstadt, Germany
>>> Phone: +49 6151 1622429
>>> www.esa.informatik.tu-darmstadt.de
>>>
>>>>
>>>> Oh, I forgot about this one.
>>>>
>>>> The math stuff works because all declare variant functions are static.
>>>>
>>>> I think if we need to replace the `.` with a symbol that the user
>> cannot
>>>>
>>>> use but the ptax assembler is not upset about. we should also move
>>>>
>>>> `getOpenMPVariantManglingSeparatorStr` from `Decl.h` into
>>>>
>>>> `llvm/lib/Frontends/OpenMP/OMPContext.h`, I forgot why I didn't.
>>>>
>>>> You should also be able to use the clang builtin atomics and even the
>>>>
>>>> `omp atomic` should eventually resolve to the same thing (I hope).
>>>>
>>>> Let me know if that helps,
>>>>
>>>> Johannes
>>>>
>>>>> Hi all,
>>>>>
>>>>> what's the current status of declare variant when compiling for
>> Nvidia
>>>>> GPUs?
>>>>>
>>>>> In my code, I have declared a variant of a function, that uses CUDA's
>>>>> built-in atomicAdd (using the syntax from OpenMP TR8):
>>>>>
>>>>>> #pragma omp begin declare variant match(device={kind(nohost)})
>>>>>>
>>>>>> void atom_add(double* address, double val){
>>>>>> atomicAdd(address, val);
>>>>>> }
>>>>>>
>>>>>> #pragma omp end declare variant
>>>>> When compiling with Clang from master, ptxas fails:
>>>>>
>>>>>> clang++ -fopenmp -O3 -std=c++11 -fopenmp
>>>>>> -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target -march=sm_72 -v
>>>>>> [...]
>>>>>> ptxas kernel-openmp-nvptx64-nvidia-cuda.s, line 322; fatal :
>> Parsing
>>>>>> error near '.ompvariant': syntax error
>>>>>> ptxas fatal : Ptx assembly aborted due to errors
>>>>>> [...]
>>>>>> clang-11: error: ptxas command failed with exit code 255 (use -v to
>>>>>> see invocation)
>>>>> The line mentioned in the ptxas error looks like this:
>>>>>
>>>>>> // .globl _Z33atom_add.ompvariant.S2.s6.PnohostPdd
>>>>>> .visible .func _Z33atom_add.ompvariant.S2.s6.PnohostPdd(
>>>>>> .param .b64
>> _Z33atom_add.ompvariant.S2.s6.PnohostPdd_param_0,
>>>>>> .param .b64 _Z33atom_add.ompvariant.S2.s6.PnohostPdd_param_1

For some context on this, see the following set of reviews:
https://reviews.llvm.org/D17738
https://reviews.llvm.org/D29883
https://reviews.llvm.org/D39005
https://reviews.llvm.org/D40573

My favorite statement from the last patch: "This is silly. This bug has
been open for so long that nvidia could've just fixed their toolchain
by now to accept dots in symbol names."

Jonas

Hi Johannes,

Yes, we need to replace the dot with either of the symbols you
mentioned. If we can use the same symbols on the host, I'm fine with
chaining it unconditionally.

I've prototyped this change locally, replacing the dot with a
dollar-sign in the places you mentioned. With this change, declaring
variants for offloading to CUDA devices works for my testcase in my
setup (ARM + Nvidia SM 7.2).

If you want, I could create a patch for that, but do you (or anyone else
on the list) know whether the other offloading targets (AMDGPU, NEC VE,
...) are all able to handle the dollar sign in the mangled name, so we
could make this change unconditionally? I do not have a setup for each
of these to test.

I was also wondering if legalizing the mangled name shouldn't be handled
by the backend (NVPTX in this case) instead of the OpenMP-specific parts.

For some context on this, see the following set of reviews:
https://reviews.llvm.org/D17738
https://reviews.llvm.org/D29883
https://reviews.llvm.org/D39005
https://reviews.llvm.org/D40573

My favorite statement from the last patch: "This is silly. This bug has
been open for so long that nvidia could've just fixed their toolchain
by now to accept dots in symbol names."

Thanks for the links!

I think in those reviews people also concluded you cannot/should not modify visible symbol names late.

I guess, since we have _$_ in other cases we should do _$_ as separator in Clang too. WDYT?

Probably depending on the target architecture, see for example
CGOpenMPRuntime::getOutlinedHelperName() and the override in
CGOpenMPRuntimeNVPTX for precedence.

I would like to avoid using just __, for c++filt reasons. CGOpenMPRuntime is

going away, if we want to change something this is the time.

Hi Johannes,

Yes, we need to replace the dot with either of the symbols you
mentioned. If we can use the same symbols on the host, I'm fine with
chaining it unconditionally.

I've prototyped this change locally, replacing the dot with a
dollar-sign in the places you mentioned. With this change, declaring
variants for offloading to CUDA devices works for my testcase in my
setup (ARM + Nvidia SM 7.2).

If you want, I could create a patch for that, but do you (or
anyone else
on the list) know whether the other offloading targets (AMDGPU,
NEC VE,
...) are all able to handle the dollar sign in the mangled name,
so we
could make this change unconditionally? I do not have a setup for
each
of these to test.

I was also wondering if legalizing the mangled name shouldn't be
handled
by the backend (NVPTX in this case) instead of the OpenMP-specific
parts.

For some context on this, see the following set of reviews:
https://reviews.llvm.org/D17738
https://reviews.llvm.org/D29883
https://reviews.llvm.org/D39005
https://reviews.llvm.org/D40573

My favorite statement from the last patch: "This is silly. This bug
has
been open for so long that nvidia could've just fixed their toolchain
by now to accept dots in symbol names."

Thanks for the links!

I think in those reviews people also concluded you cannot/should not
modify visible symbol names late.

I guess, since we have _$_ in other cases we should do _$_ as separator
in Clang too. WDYT?

Probably depending on the target architecture, see for example
CGOpenMPRuntime::getOutlinedHelperName() and the override in
CGOpenMPRuntimeNVPTX for precedence.

I would like to avoid using just __, for c++filt reasons.
CGOpenMPRuntime is

going away, if we want to change something this is the time.

IMHO, if we can have a single solution that works on all (currently
supported) offloading targets, we should use that and avoid additional
complexity of a target-dependent solution.

_$_ seems like a reasonable solution, as ISO C/C++ forbids dollar-signs
in identifiers (even though most compilers still accept it/have flags
for it.)

WDYT? Would you prefer a mechanism for a target-dependent separator?

Please prepare a patch. People can then test it, e.g., AMD folks, but I
expect it to work fine and us adopting it

Ok, will do after we agree on a solution.

Best,

Lukas

Hi Johannes,

Yes, we need to replace the dot with either of the symbols you
mentioned. If we can use the same symbols on the host, I'm fine with
chaining it unconditionally.

I've prototyped this change locally, replacing the dot with a
dollar-sign in the places you mentioned. With this change, declaring
variants for offloading to CUDA devices works for my testcase in my
setup (ARM + Nvidia SM 7.2).

If you want, I could create a patch for that, but do you (or
anyone else
on the list) know whether the other offloading targets (AMDGPU,
NEC VE,
...) are all able to handle the dollar sign in the mangled name,
so we
could make this change unconditionally? I do not have a setup for
each
of these to test.

I was also wondering if legalizing the mangled name shouldn't be
handled
by the backend (NVPTX in this case) instead of the OpenMP-specific
parts.

For some context on this, see the following set of reviews:
https://reviews.llvm.org/D17738
https://reviews.llvm.org/D29883
https://reviews.llvm.org/D39005
https://reviews.llvm.org/D40573

My favorite statement from the last patch: "This is silly. This bug
has
been open for so long that nvidia could've just fixed their toolchain
by now to accept dots in symbol names."

Thanks for the links!

I think in those reviews people also concluded you cannot/should not
modify visible symbol names late.

I guess, since we have _$_ in other cases we should do _$_ as separator
in Clang too. WDYT?

Probably depending on the target architecture, see for example
CGOpenMPRuntime::getOutlinedHelperName() and the override in
CGOpenMPRuntimeNVPTX for precedence.

I would like to avoid using just __, for c++filt reasons.
CGOpenMPRuntime is

going away, if we want to change something this is the time.

IMHO, if we can have a single solution that works on all (currently
supported) offloading targets, we should use that and avoid additional
complexity of a target-dependent solution.

_$_ seems like a reasonable solution, as ISO C/C++ forbids dollar-signs
in identifiers (even though most compilers still accept it/have flags
for it.)

WDYT? Would you prefer a mechanism for a target-dependent separator?

I'd really prefer something that works with c++filt if we don't have

a reason not to. If you pick a separator starting with "$" it should

work fine. If there is an underscore or not after, I don't care much.

Please prepare a patch. People can then test it, e.g., AMD folks, but I
expect it to work fine and us adopting it

Ok, will do after we agree on a solution.

I CC'ed Ravi and Jon to confirm that Intel and AMD toolchains will be

be fine with "$" in the symbol name. We already know it should work for

NVIDIA and NEC. I guess that would be all.

$ seems like a reasonable solution, as ISO C/C++ forbids dollar-sign
I CC’ed Ravi and Jon to confirm that Intel and AMD toolchains will be
be fine with “$” in the symbol name. We already know it should work for
NVIDIA and NEC. I guess that would be all.

‘$’ works fine on amdgcn. We’re using llvm-mc for the assembler and (nearly?) standard elf for the binary formats so it should work really. Hello world style test passes too.

Embedded \0 in the middle of the symbol name was rejected in IR but anything else seems to work.

Thanks!

Jon

I was not following this thread.

Is this for function names only?

I believe if it is function names only should be ok.

Thanks

Ravi

Function names only, yes.

Basically, if “$” as part of a function name is OK in the Intel pipeline.

We will use that to mangle declare variant functions because the “.” we used so far upsets nvcc…

Ok, so dollar-sign should work on all architectures and I will create a patch for that.

Best,

Lukas