SPIR provisional specification is now available in the Khronos website

Hi James,

some additional comments regarding some of your questions:
Q: Is SPIR meant to be storage-only, or to allow optimizations to be done?

I agree with Micah that optimizing a SPIR module might make it less portable.
However, SPIR doesn't prohibit optimizations. It is up to the OpenCL optimizer to decide when to "materialize" SPIR to a device specific LLVM module or even convert it to another IR.
It would be useful if we could identify areas in the specification that might break this assumption and discuss the observed limitations.
I think SPIR's offering would be stronger if optimizations could be performed safely.
So the answer to your question is: no, it is not just a storage-only format.

Q: So what's the advantage of adding a semantic-less calling convention over metadata?

Metadata could be used as well, and in fact - this is the current approach used by clang today.
1. However, since this is not a storage-only format, we felt that it would be useful to differentiate the calling conventions from the existing ones in LLVM.
2. Also, since SPIR kernels & functions are device and OS agnostic no calling convention is really suitable for SPIR. Hence, we chose to introduce new ones.
3. Another smaller reason in favor of a new calling convention vs. metadata is the fact that metadata can't be associated to functions in LLVM. This makes the metadata arrangement a bit more complex and less trivial to access by OpenCL optimizers.

Does this make sense? Do you see an issue with adding the suggested calling conventions?

Thanks,
Boaz

Hi Boaz, Micah,

Thanks for the followup.

I agree with Micah that optimizing a SPIR module might make it less portable.
However, SPIR doesn't prohibit optimizations. It is up to the OpenCL optimizer to decide when to "materialize" SPIR to a device specific LLVM module or even convert it to another IR.
It would be useful if we could identify areas in the specification that might break this assumption and discuss the observed limitations.
I think SPIR's offering would be stronger if optimizations could be performed safely.
So the answer to your question is: no, it is not just a storage-only format.

My only issue then, is if it is not specifically designed to be a
storage-only format, why restrict the allowable instructions to the
strict set that can be generated by an OpenCL-C frontend? This would
make sense for a storage-only format, but as soon as you let
optimizers loose on it it will by definition change the IR. The
transformation from scalar extractelement/fptoint/insertelement to a
vector fptoint is one concrete example I noticed when perusing the
spec where a valid, and target independent, optimisation could result
in invalid SPIR. I was wondering the benefit of restricting the
allowable instructions so strongly.

Q: So what's the advantage of adding a semantic-less calling convention over metadata?

Metadata could be used as well, and in fact - this is the current approach used by clang today.
1. However, since this is not a storage-only format, we felt that it would be useful to differentiate the calling conventions from the existing ones in LLVM.
2. Also, since SPIR kernels & functions are device and OS agnostic no calling convention is really suitable for SPIR. Hence, we chose to introduce new ones.
3. Another smaller reason in favor of a new calling convention vs. metadata is the fact that metadata can't be associated to functions in LLVM. This makes the metadata arrangement a bit more complex and less trivial to access by OpenCL optimizers.

My worry here, (and please take this with a pinch of salt because I am
by no means one of the core LLVM developers), is that calling
conventions in the IR are primarily related to code generation. As
SPIR isn't going to be used for codegen (pass SPIR to some backend /
converter that is responsible for that), why worry about the calling
convention? All it is is a wart, it doesn't affect midend phases at
all.

I should note that passing structs (byval or not) in the first
parameter as an sret argument is *independent of the LLVM calling
convention*, and is done at the Clang level. This is part of
LLVM/Clang where responsibility for adhering to an ABI overlaps
between the IR-generator and the IR itself. Adding a new calling
convention in IR and marking functions with it won't stop the IR
generator having to make ABI decisions.

Not only that, but for valid codegen any backend is going to have to
remove those calling convention markers anyway and replace them with
their own, so why have them in the first place?

I've rambled slightly, sorry about that!

Hi James,

This is very good feedback.

1. Adding the new calling conventions - It seems like the appropriate thing to do vs. metadata. Some OpenCL backends can choose to implement this calling convention and use it during code generation of OpenCL functions/kernels. Can we agree on this item?
2. Restricting the allowable instructions - As Micah mentioned before, the restrictions are there because we are only looking at this for OpenCL at this time. Hence, we currently only map what OpenCL supports. However, I agree that this might make some optimizations incompatible with SPIR. Let me discuss this a bit further in Khronos and come back with additional feedback.

Thanks,
Boaz

Hi Boaz, David,

Thanks for taking my responses on board.

1. Adding the new calling conventions - It seems like the appropriate thing to do vs. metadata. Some OpenCL backends can choose to implement this calling convention and use it during code generation of OpenCL functions/kernels. Can we agree on this item?

Hmm, this is the one I was most shaky on. I still don't fully
understand what you're using this calling convention for, other than:

  1. A marker to distinguish kernels from regular functions, in which
case there are better ways to do that (metadata)
  2. A way to remove remnants of the platform or C calling convention.
In this case, I've mentioned that there are things such as
pass-struct-by-pointer that are lowered in the IR-generation stage
that you haven't addressed. Without addressing this (the
IR-generator's part of the ABI compliance contract), I'm not sure how
a new LLVM calling convention helps remove platform dependence.

Although I may have missed a/the point somewhere along the line, I've
been a bit ill in recent days and not fully engaged, brain-wise :slight_smile:

Cheers,

James

From: mankeyrabbit@gmail.com [mailto:mankeyrabbit@gmail.com] On Behalf
Of James Molloy
Sent: Wednesday, September 12, 2012 12:18 PM
To: Ouriel, Boaz
Cc: cfe-dev@cs.uiuc.edu; llvmdev@cs.uiuc.edu; Villmow, Micah
Subject: Re: [cfe-dev] [LLVMdev] SPIR provisional specification is now
available in the Khronos website

Hi Boaz, David,

Thanks for taking my responses on board.

> 1. Adding the new calling conventions - It seems like the appropriate
thing to do vs. metadata. Some OpenCL backends can choose to implement
this calling convention and use it during code generation of OpenCL
functions/kernels. Can we agree on this item?

Hmm, this is the one I was most shaky on. I still don't fully
understand what you're using this calling convention for, other than:

  1. A marker to distinguish kernels from regular functions, in which
case there are better ways to do that (metadata)

[Villmow, Micah] I disagree, the 'kernel' keyword specifies a different calling convention from functions that don't have it. So we need a calling convention that maps to that. The ABI for a kernel is different than the ABI for a non-kernel function. The regular function itself also is different than the default calling convention because it has restrictions on it(one being it can only be called from a kernel function or another device function) that the normal calling convention does not. A way to think about this is that this is more similar to the PTX_[Kernel|Device] calling conventions than the default, fast or stdcall conventions. I don't think metadata is the right approach here since we are specifying unique ways for how these functions can be called.

  2. A way to remove remnants of the platform or C calling convention.
In this case, I've mentioned that there are things such as
pass-struct-by-pointer that are lowered in the IR-generation stage
that you haven't addressed. Without addressing this (the
IR-generator's part of the ABI compliance contract), I'm not sure how
a new LLVM calling convention helps remove platform dependence.

[Villmow, Micah] Ok, fair point, this is something that the specification should probably address.

For what it's worth, this issue manifests itself in an unsolved pocl
bug: Bug #987905 “non-pointer struct kernel arguments fail due to var...” : Bugs : pocl

It would be simpler to implement a portable implementation for calling the
kernel from the host if we could assume the kernel calling convention mapped
each OpenCL setArg arg to a single kernel function arg (and preferably all
arg data in memory). For the non-kernel functions it should not matter and
could be target-specific.

Hi all,

Another OpenCL C implementation issue I'm currently fighting with is how
to best implement the automatic __local variables. Seems SPIR enforces
the current Clang implementation of them that converts the automatic
locals to C function static variables (thus, in practice global variables).

Clearly, this is not a thread safe "final implementation", thus works as is
only when multiple work groups of the same kernel are not executed in parallel. Therefore, some other compiler pass is assumed to convert those
function static (module global variables) to some other storage where the
local buffers are allocated per work group thread.

The pocl implementation is what was suggested some time ago in this list:
the locals are converted to local arguments to the kernel function which
are then passed per-thread buffers when the work group is executed. Thus,
pocl needs to convert the references to these dummy globals to local
buffer pointers at the end of the kernel function argument list.

The problem from the use of the "semantically inadequate" 'function
static' variables for the local buffers is caused by LLVM/Clang thinking
they are buffers with a constant base which they eventually won't be in
a parallel WG implementation. This triggers an issue I'm currently working on pocl: Bug #1032203 “kernel compiler crash with a private constant arra...” : Bugs : pocl because Clang generates
constant GEPs for the local buffer accesses (even though in a parallel
thread-safe implementation the local variables cannot be stored to
constant locations).

So, I wonder if this piece of SPIR specs might cause other similar
problems (LLVM optimizing incorrectly due to the slightly wrong semantics)
in the future and should be improved. The minimal fix would be
to add some kind of attribute to the function static global that prevents
Clang/LLVM thinking the address is constant and apply optimizations that rely
on that. Semantically the local buffer is actually a thread-local variable.
Are thread locals somehow supported in LLVM IR?

Hi,

I don’t fully understand your problem description.

…is caused by LLVM/Clang thinking
they are buffers with a constant base which they eventually won’t be in
a parallel WG implementation. This triggers an issue I’m currently working on pocl: https://bugs.launchpad.net/pocl/+bug/1032203 because Clang generates
constant GEPs for the local buffer accesses (even though in a parallel
thread-safe implementation the local variables cannot be stored to
constant locations).

Surely if you’re passing in pointers to the kernel function that differ depending on workgroup, then a GEP from those pointers of a constant amount is perfectly safe. Why would a constant GEP from a per-workgroup base be a problem?

I’m sure there’s something I’ve misunderstood about your solution…

Cheers,

James

Hi,

Sorry, With a prod from Silviu (cc’d) I now understand - I was interpreting your use of “constant GEP” as “GEP with constant operand” as opposed to “ConstantGEP node” which of course can only take a Constant* operand, not a Value* operand.

I now fully see the problem and realise that my solution is also prone to that problem :slight_smile:

Cheers,

James

Well,

To be honest I'm not very comfortable with the whole constant GEP
idea. It's a new thing to me and I do not fully understand its
point in LLVM IR, so I probably wasn't very clear :wink:

Anyways, me bringing it up was meant as an example of what can happen
if one (mis)uses the C function static variable semantics for something that
really is a thread local variable (in usual thread parallel implementations).

For the record, I just workarounded it in pocl by borrowing the
BreakConstantGEPs code from SAFECode. But for SPIR specs, IMHO, this should
be reconsidered.

For the record, I just workarounded it in pocl by borrowing the
BreakConstantGEPs code from SAFECode. But for SPIR specs, IMHO, this should
be reconsidered.

Yes, I agree.

Micah, Boaz,

Do you guys have any ideas about how to fix this issue?

Cheers,

James

It is my view that this is an implementation detail and not an issue with the SPIR spec. As SPIR is just a representation of a program in a portable manner, it is up to the consumer of SPIR to correctly set up the kernels based on the devices calling convention/ABI when the SPIR binary is loaded for that specific device.

The question was not about implementing the automatic locals (which is
a device specific detail, like you correctly state), but enforcing LLVM IR
for the automatic locals that potentially leads to illegal optimizations
due to the inadequate semantics of global variables for this use.

If SPIR enforces this type of bitcode for the automatic locals, it means when
such optimizations do happen (the optimizations might be beneficial in
general so they cannot be just disabled due to the SPIR flaw), the
implementers have to work around them with kludges to implement the real
automatic local semantics. What's worse, at some point there might be
an optimization that is not easily worked around which makes this part of the
SPIR specs look bad.

Of course it's possible the constantGEP case was the only problem we will
ever get from this issue, but I wouldn't rely on it in an IR standard
specification if it's possible to avoid it.

BR,

Indeed; I agree it is not an implementation detail as potentially valid SPIR could be almost untranslatable to valid code running on a target without dedicated workgroup-local memory.

Micah: the problem can be distilled down to: __local variables in SPIR are represented as Constants (GlobalVariable : public Constant), but they are not in fact constant, for a device with no workgroup-local memory.

So it is valid SPIR, as the specification stands, to manipulate __local variables as Constants in a way that is extremely difficult to undo. That is, in order to transform SPIR to code that can run on a CPU, the GlobalVariable (which is a subclass of Constant) must be replaced with a dynamically calculated Value (which is not a subclass of constant).

The GlobalVariable can be used in ConstantExprs (of which there are many valid), and converting ConstantExprs to their Instruction corrollaries is very difficult in the general case.

Cheers,

James

Hi guys,

So it is valid SPIR, as the specification stands, to manipulate __local
variables as Constants in a way that is extremely difficult to undo. That
is, in order to transform SPIR to code that can run on a CPU, the
GlobalVariable (which is a subclass of Constant) must be replaced with a
dynamically calculated Value (which is not a subclass of constant).

What about translating automatic locals to function scope pointers?
This will make handling of automatic locals and local pointer
arguments similar, which is desirable as they are just a way to
describe the same thing (I understand automatic locals as just a
simpler way to use local buffers than local arguments).

In fact, pocl converts automatic locals to implicit "extra" kernel
arguments and manages both cases the same way.

Carlos

Carlos,
AMD's OpenCL implementation(both CPU and GPU) has worked for years with the way SPIR represents locals. If there is problems with the representation then it is an implementation issue. One of the issues with using extra kernel arguments is that it requires extra validation and complexity at the runtime level that is not needed if it is handled internally by the compiler. That being said, both ways of doing it are equally valid, but the choice of which way to do it is a implementation decision. I don't think it would be that difficult to lower global variables to function arguments given SPIR representation.

Micah

Micah,

You’re saying it works for you, but Clang doesn’t currently anywhere near the range of horrible constantexpr constructs it is possible to create. You can “get by” at the moment with just handling ConstantGEPs, because of the way Clang works.

But SPIR isn’t restricted to Clang, and the problem is that it is possible (although not probable, or nice, but that is irrelevant for corner conditions) to get valid SPIR that it is very difficult to get into a shape that you can code generate for CPUs.

Even the SAFECode snippet that Pekka noted doesn’t even handle the case of ConstantShuffleVectors, for example.

You can easily simplify this problem with a restriction in SPIR: disallow ConstantExpr casts - no ptrtoint constant expression. Because GlobalVariables have pointer type, if you disallow converting their type to non-pointer type in a constantexpr, the number of constantexpr subclasses you have to deal with is drastically reduced (to essentially just BitCast and GEP).

That would be a simple, reasonable restriction that would stop potentially maliciously horrible test cases causing all CPU SPIR clients to write upwards of a hundred lines of conversion code.

Cheers,

James

James,

Thanks for the suggestion for how to make this case easier to handle. I’ll bring this up to the entire working group in our next meeting.

Micah

Are you proposing to disallow the use of an IR instruction type to *possibly*
avoid problems from the (slight) misuse of another LLVM IR construct?

IMHO there should be a more robust solution that solves the misused construct
instead of just trying to "put out the fires it creates". E.g. some sort of
"thread local"-type of qualifier for the global which disallows certain
optimizations. A new linkage type perhaps? Someone more familiar with the
LLVM IR than me might have a better idea.

I understand that adding the automatic local as a kernel argument (in the specs)
is too intrusive now given the existing implementations that do it
otherwise, especially for those for which the semantics "happens" to be
correct as is. That is, the currently popular GPUs with separate
local/scratchpad memories.

Have a nice weekend,