[CUDA] Why "Disallow 'extern __shared__' variables"?

Hi,

I was wondering why Clang-cuda now disallows adding ‘extern’ on shared variables with this patch:
https://reviews.llvm.org/D25125?id=73133

I was using it successfully, and I believe correctly, in both nvcc and clang-cuda to build multiple relocatable objects with extern shared memory references that linked to these globals with nvlink. Please let me know if you would like specific examples and I’ll be happy to describe it.

Thanks,
Arpith

Clang will accept

  "extern __shared__ int x;"

The CUDA programming guide is not precise, but as I read it, we're
doing the right thing here:

  CUDA C++ Programming Guide

Given that "extern __shared__" means "get me a pointer to the
dynamically-allocated shared memory for this kernel," using a
non-array / non-pointer type would be...odd?

nvcc may accept it without the "", but this is a pretty low bar to
set for semantic correctness. :slight_smile: clang already rejects plenty of
semantically-questionable code that nvcc accepts.

I agree the error should be better, though. I'd be happy to review a patch?

I was using it successfully, and I believe correctly, in both nvcc and clang-cuda to build multiple relocatable objects with extern shared memory references that linked to these globals with nvlink.

I am impressed and a little alarmed that you've gotten this to work
with clang. There is code in clang that's explicitly incompatible
with multiple translation unit device-side compilation. Off the top
of my head: We mark all device functions as internal, which we need in
order to get some key optimizations in llvm. But this means that if
you declare a function it won't be emitted unless it's used.

This is something we've wanted to fix, but it's going to be a bit
tricky to do it in a way that doesn't regress performance for people
doing single-TU compilation. We'll probably need to add a flag. If
you're interested in making this work, I'd be happy to provide
guidance patching.

-Justin

Hi Justin,

Thanks for your response.

I am using a mix of our OpenMP nvptx toolchain for OpenMP-gpu programs and Clang-Cuda for the OpenMP runtime that we’ve written in Cuda. This may be the source of some of your surprises.

I translate the Cuda code to LLVM IR and pull it into the user’s GPU program (with -mlink-cuda-bitcode, similar to how you pull in libdevice.compute.bc). We then use our toolchain to build relocatable objects with ptxas. I’ll be happy to talk more about our use case and how we can make the improvements you suggest.

> Given that "extern __shared__" means "get me a pointer to the
> dynamically-allocated shared memory for this kernel," using a
> non-array / non-pointer type would be...odd?
>

I believe the difference is whether the cuda code is being compiled in whole-program or separate compilation modes. The following section covers the case I described for separate compilation mode, which is what I'm doing:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-qualifiers

"When compiling in the separate compilation mode (see the nvcc user manual for a description of this mode), __device__, __shared__, and __constant__ variables can be defined as external using the extern keyword. nvlink will generate an error when it cannot find a definition for an external variable (unless it is a dynamically allocated __shared__ variable)."

Can we add a flag in Clang-Cuda to indicate separate compilation mode?

Could you point me to patches/code that I can look at to understand the implications of separate compilation? What LLVM optimizations benefit from whole-program compilation mode? What, if anything, breaks if I use it for separate compilation?

Thanks in advance for your time.

Arpith

[+Samuel]

From: "Arpith C Jacob via cfe-dev" <cfe-dev@lists.llvm.org>
To: "Justin Lebar" <jlebar@google.com>
Cc: "cfe-dev" <cfe-dev@lists.llvm.org>
Sent: Friday, October 28, 2016 11:58:17 AM
Subject: Re: [cfe-dev] [CUDA] Why "Disallow 'extern __shared__' variables"?

For some additional context, see also:

  30812 – Binds for OpenMP offloading actions are not yet implemented for NVPTX targets
  26343 – Cuda: Relocatable device code doesn't work

Hi Justin,

Thanks for your response.

I am using a mix of our OpenMP nvptx toolchain for OpenMP-gpu
programs and Clang-Cuda for the OpenMP runtime that we've written in
Cuda. This may be the source of some of your surprises.

I translate the Cuda code to LLVM IR and pull it into the user's GPU
program (with - mlink-cuda-bitcode , similar to how you pull in
libdevice.compute.bc). We then use our toolchain to build
relocatable objects with ptxas. I'll be happy to talk more about our
use case and how we can make the improvements you suggest.

> Given that "extern __shared__" means "get me a pointer to the
> dynamically-allocated shared memory for this kernel," using a
> non-array / non-pointer type would be...odd?
>

I believe the difference is whether the cuda code is being compiled
in whole-program or separate compilation modes. The following
section covers the case I described for separate compilation mode,
which is what I'm doing:
CUDA C++ Programming Guide

"When compiling in the separate compilation mode (see the nvcc user
manual for a description of this mode), __device__, __shared__, and
__constant__ variables can be defined as external using the extern
keyword. nvlink will generate an error when it cannot find a
definition for an external variable (unless it is a dynamically
allocated __shared__ variable)."

Can we add a flag in Clang-Cuda to indicate separate compilation
mode?

I'd definitely like to see this happen. I have users for whom this capability is important.

-Hal

Can we add a flag in Clang-Cuda to indicate separate compilation mode?

Yes, I would be happy to take such a patch.

Could you point me to patches/code that I can look at to understand the implications of separate compilation?

There's a TODO in NVPTXAsmPrinter, but that's an ABI compatibility
issue, which isn't a problem if you're compiling everything with
clang.

The "mark everything as internal" code is in
CodeGenModule::getLLVMLinkageForDeclarator -- that's the big one, off
the top of my head.

What LLVM optimizations benefit from whole-program compilation mode?

Many interprocedural optimizations will not fire on externally-visible
ODR functions (basically, anything "inline" or a template that's not
"static" or in an anon namespace).

I believe, for CUDA specifically, there's an optimization for const
__restrict pointers that lets us translate reads into __ldg
instructions, but we can't do this when the function is not internal.
I am not sure, though.

What, if anything, breaks if I use it for separate compilation?

See above, and my previous email.

Regards,

-Justin

Thanks. I’ll take a look.

> There's a TODO in NVPTXAsmPrinter, but that's an ABI compatibility
> issue, which isn't a problem if you're compiling everything with
> clang.

Can you verify if the bug in ptxas for which this patch (https://reviews.llvm.org/D22428)`` was submitted still exists with CUDA 8.0? If you have the buggy CUDA program I can check.

Thanks,
Arpith

Can you verify if the bug in ptxas for which this patch (⚙ D22428 [NVPTX] Force minimum alignment of 4 for byval arguments of device-side functions.) was submitted still exists with CUDA 8.0?

Even if it's fixed in CUDA 8, clang still supports old CUDA versions,
so I'm not sure there's much we could do, unless you want LLVM to
detect the CUDA version.

But I am 95% sure that Art tested on CUDA 8 and the bug still exists there.

The main impact is that the optimizer in general knows it sees all the uses of every variables and function. It means the ABI/calling convention can be changed, arguments can be eliminated, there is less tradeoff inlining a function when there is a single use, global variable can be turned into local variable sometimes, alias analysis is a lot better for global variables, etc.