PNaCl Bitcode reference manual

Hello,

Following an earlier email (http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-June/063010.html), we’ve published an initial version of the PNaCl bitcode reference manual online - http://www.chromium.org/nativeclient/pnacl/bitcode-abi. The PNaCl bitcode is a restricted subset of LLVM IR.

The reference manual is quite terse, so for the bigger picture I’ll repost links to the design document:

Any comments would be most welcome. If anything isn’t clear, please speak up - we intend to improve the documentation incrementally. We’re also working on better formatting, so consider this an early preview :slight_smile:

Eli

Hi Eli,

I appreciate you for opening the process for input and comments. One
question stood out to me while reading the document:

The document [1] indicates that only the 'ccc' calling convention will
be supported. The LLVM documentation [2] prominently notes that,
"tail calls can only be optimized when [fastcc], the GHC or the HiPE
convention is used." Further, I notice that the document includes
"call" but not "tail call" in the list of supported instructions.

Do I understand correctly that this means that reliable tail call
optimization will not be possible under PNaCL as currently imagined?

That would be a real shame. Languages such as Scheme, Haskell,
Erlang, and Lua require tail call optimization. Working around the
lack of TCO with trampolines degrades performance, requires a major
reworking of the compiler front-end, and is ugly. Such hacks really
shouldn't be needed in 2013, particularly when LLVM already went to
the trouble of supporting TCO for exactly these good reasons.

The JVM made this mistake in its byte code and many groups such as the
Clojure and Scala communities have been clamoring for years to get TCO
into the JVM. ECMAScript has this error as well, but it's forgivable
as Javascript wasn't originally intended to be, as it is now, a
compiler target.

Indeed, I believe much of the enthusiasm for (P)NaCL stems from the
hope that we'll finally be able to compile arbitrary languages for the
browser without being unduly hampered or constrained. Without TCO the
excitement here would be diminished. Functional programming languages
would be kneecapped by the bytecode.

If my understanding above is correct, how can we get PNaCL to support
a sufficiently general calling convention to make TCO possible?

The "Stability of the PNaCl bitcode ABI" [3] document notes that the
translator will be restoring the fastcc convention (though issue #2346
notes this may only happen after the first release). Perhaps we could
support the "tail call" instruction and have the translator ensure an
appropriate calling convention is restored when that is seen? Or
perhaps we could revisit our reluctance to support multiple calling
conventions? Or perhaps we could address the issues with fastcc that
are causing us to reject it, or create a new calling convention that
is simultaneously acceptable for our needs and supports TCO?

Certain individuals who I judged might be interested in working out a
solution here were CCed on an otherwise identical version of this
message that did not go out to the list.

Thanks,

-- TC

[1] http://www.chromium.org/nativeclient/pnacl/bitcode-abi

[2] http://llvm.org/releases/3.3/docs/LangRef.html#callingconv

[3] https://sites.google.com/a/chromium.org/dev/nativeclient/pnacl/stability-of-the-pnacl-bitcode-abi

I’m not familiar with PNaCl but, FWIW, I’d say the three main advancements the CLR made over the JVM are:

· Structs (aka value types).

· Reified generics. http://research.microsoft.com/pubs/64031/designandimplementationofgenerics.pdf

· Tail call elimination. http://research.microsoft.com/pubs/69132/babel01.pdf

Structs give you more freedom around memory representation (e.g. the CLR can represent an array of pairs of floats and ints in a single contiguous block of memory whereas the JVM cannot). The main practical benefit is performance but also interoperability (e.g. SOA vs AOS with GPUs). In particular, as structs are (usually) unboxed they do not incur allocations, introduce indirections via pointers and do not stress the garbage collector. In fact, structs can be used to completely avoid the garbage collector’s contribution to latency (Rapid Addition used this technique to eliminate GC latency from their commercial FIX implementation http://www.microsoft.com/en-us/download/details.aspx?id=20918). >From an implementation perspective, supporting structs makes it possible to get away with a much less efficient garbage collector because programmers can simply avoid the GC when necessary.

Reified generics combined with structs make it possible to write very efficient generic collections (e.g. .NET Dictionary can be 17x faster than Java’s HashTable because it is stored as a single continuous block of memory with no pointers http://fsharpnews.blogspot.co.uk/2010/05/java-vs-f.html).

Tail call elimination allows an unbounded number of recursive function calls in tail position to consume finite stack space. This is essential for traditional functional programming but has other practical applications including representing state machines as mutually tail recursive functions and optimizing async code to avoid trampolines. F# relies upon the tail calls built into .NET since 2001. Note that tail call elimination applies not only to static function calls but also dynamic calls and it is an optimization in space and not time (on .NET general tail calls are ~2x slower than non-tail calls).

Cheers,

Jon.

> we've published an initial version of the PNaCl bitcode reference
> manual online -
> http://www.chromium.org/nativeclient/pnacl/bitcode-abi. The PNaCl
> bitcode is a restricted subset of LLVM IR.
>
> Any comments would be most welcome.

Hi Eli,

I appreciate you for opening the process for input and comments. One
question stood out to me while reading the document:

The document [1] indicates that only the 'ccc' calling convention will
be supported. The LLVM documentation [2] prominently notes that,
"tail calls can only be optimized when [fastcc], the GHC or the HiPE
convention is used." Further, I notice that the document includes
"call" but not "tail call" in the list of supported instructions.

Do I understand correctly that this means that reliable tail call
optimization will not be possible under PNaCL as currently imagined?

Not at all. The tail call elimination pass runs during mid-level
optimizations, before we strip calling conventions for PNaCl ABI
stabilization. The "tail" marker on calls is not forbidden in PNaCl.
However, it would be fair to say that we didn't invest a lot of effort into
making sure TCO works in all cases because TCO support for C and C++ in
LLVM is very rudimentary. For example, in LLVM 3.3 (the version the current
preview of PNaCl is based on), this is the implementation of
AllocaMightEscapeToCalls:

static bool AllocaMightEscapeToCalls(AllocaInst *AI) {
  // FIXME: do simple 'address taken' analysis.
  return true;
}

Michael Gottesman (+CC) improved this a couple of weeks ago - this area is
still work in progress. While we'll be happy to have other languages target
PNaCl, our current focus is on C and C++. Also see below.

That would be a real shame. Languages such as Scheme, Haskell,
Erlang, and Lua require tail call optimization. Working around the
lack of TCO with trampolines degrades performance, requires a major
reworking of the compiler front-end, and is ugly. Such hacks really
shouldn't be needed in 2013, particularly when LLVM already went to
the trouble of supporting TCO for exactly these good reasons.

The JVM made this mistake in its byte code and many groups such as the
Clojure and Scala communities have been clamoring for years to get TCO
into the JVM. ECMAScript has this error as well, but it's forgivable
as Javascript wasn't originally intended to be, as it is now, a
compiler target.

Indeed, I believe much of the enthusiasm for (P)NaCL stems from the
hope that we'll finally be able to compile arbitrary languages for the
browser without being unduly hampered or constrained. Without TCO the
excitement here would be diminished. Functional programming languages
would be kneecapped by the bytecode.

If my understanding above is correct, how can we get PNaCL to support
a sufficiently general calling convention to make TCO possible?

The "Stability of the PNaCl bitcode ABI" [3] document notes that the
translator will be restoring the fastcc convention (though issue #2346
notes this may only happen after the first release). Perhaps we could
support the "tail call" instruction and have the translator ensure an
appropriate calling convention is restored when that is seen? Or
perhaps we could revisit our reluctance to support multiple calling
conventions? Or perhaps we could address the issues with fastcc that
are causing us to reject it, or create a new calling convention that
is simultaneously acceptable for our needs and supports TCO?

This is exactly the kind of feedback we're interested in, and the PNaCl ABI
is not set in stone. I'm glad that you went over the "stability" document;
it means you noticed that we don't preclude expanding the scope of the
bitcode in the future. We do plan to ensure backwards compatibility, but
features can be added. If you, or any other interested party, wants to
investigate compiling Haskell or some other functional language to PNaCl -
please go ahead. Feel free to contact us personally or on the NaCl mailing
lists (or even in llvmdev@ for issues that are strictly LLVM IR related)
w.r.t. any problems you encounter. If there are additional
modes/instructions/attributes that need to be added to the ABI, we'll
consider it taking into account all the other constraints.

Eli

Hi Eli,

Recently, I proposed some changes to LLVM to do more lowering of illegal types (like i128 or i17) and other things within the LLVM IR layer, and the proposal was roundly rejected by the LLVM community:

http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-April/061567.html

PNaCl is essentially doing what my proposal described. How do you expect to reconcile the community’s desire to avoid doing lowering on LLVM IR with PNaCl’s design that is built on doing lowering on LLVM IR?

Dan

I expect we'll have to continue maintaining our own integer lowering IR
pass for PNaCl, unless upstream LLVM has a change of heart and decides to
add an implementation of such an IR pass.

The pass isn't excessively complex so I don't think this will be a big
burden, although it does only handle increasing the integer size (e.g. i17
-> i32), and not splitting integers up (i128 -> two i64s). Having this
pass certainly seems preferable to allowing arbitrary-sized integer types
in the PNaCl ABI.

Here's the current implementation of the pass:
http://git.chromium.org/gitweb/?p=native_client/pnacl-llvm.git;a=blob;f=lib/Transforms/NaCl/PromoteIntegers.cpp;h=017e4976f27d2557b67c3b563e3685d140517808;hb=b9657234ee8b1951db5977a8ffb55a2e5df6d76c

For comparison, Emscripten does much the same thing. It has its own
integer legalisation pass, but it's implemented in Javascript rather than
in C++.

Cheers,
Mark

I don't think that the problem was lowering in the IR, it was extending the IR to support all of the machine-dependent types that would be needed to lower for arbitrary target architectures. PNaCL only has to lower for a specific abstract architecture that only supports types that can already be represented in the IR.

David

> we've published an initial version of the PNaCl bitcode reference
> manual online -
> http://www.chromium.org/nativeclient/pnacl/bitcode-abi. The PNaCl
> bitcode is a restricted subset of LLVM IR.
>
> Any comments would be most welcome.

Hi Eli,

I appreciate you for opening the process for input and comments. One
question stood out to me while reading the document:

The document [1] indicates that only the 'ccc' calling convention will
be supported. The LLVM documentation [2] prominently notes that,
"tail calls can only be optimized when [fastcc], the GHC or the HiPE
convention is used."

That note in the documentation seems to be incorrect, because LLVM will do
tail call optimisations on at least x86 when using the "ccc" calling
convention. For example:

$ cat tail_call1.c
void foo(int arg);
void bar(int arg) {
  foo(arg);
}

$ clang tail_call1.c -S -o - -O2
...
bar: # @bar
...
    jmp foo # TAILCALL
...

However, LLVM doesn't emit a tail call at -O0.

Maybe what the documentation means to say is that tail call optimisations
are only guaranteed to be done when using fastcc etc.?

Further, I notice that the document includes

"call" but not "tail call" in the list of supported instructions.

That's an omission in the document. We do actually allow the "tail call"
instruction in PNaCl.

However, we haven't specified any guarantees that a tail call optimisation
will happen. I suppose we could specify some guarantees if people want
this. Maybe we could say that a "tail call" instruction is guaranteed to
be turned into a real tail call if the callee function has the same number
of arguments as the caller function or fewer? I think that would work on
all the architectures PNaCl targets.

Then we would have to change -O0 translation so that the guarantee is
provided consistently at all optimisation levels.

Cheers,
Mark

Hi Mark,

That note in the documentation seems to be incorrect, because LLVM
will do tail call optimisations on at least x86 when using the "ccc"
calling convention. For example:

$ cat tail_call1.c
void foo(int arg);
void bar(int arg) {
  foo(arg);
}

$ clang tail_call1.c -S -o - -O2
...
bar: # @bar
...
    jmp foo # TAILCALL
...

That's actually a sibling call, which is a much less general form.

See:

  http://llvm.org/releases/3.3/docs/CodeGenerator.html#sibling-call-optimization

The above C roughly translates into:

  declare i32 @callee(i32)
  define i32 @caller(i32 %a1) {
    %1 = tail call i32 @callee(i32 %a1)
    ret i32 %1
  }

Which, as you note, produces a jmp. However, simply extending the
argument list of the callee breaks this on x86-32:

  declare i32 @callee(i32, i32)
  define i32 @caller(i32 %a1) {
    %1 = tail call i32 @callee(i32 %a1, i32 0)
    ret i32 %1
  }

That results in a call rather than a jmp. You won't get a jmp back
unless you write:

  declare fastcc i32 @callee(i32, i32)
  define fastcc i32 @caller(i32 %a1) {
    %1 = tail call fastcc i32 @callee(i32 %a1, i32 0)
    ret i32 %1
  }

(Or use cc10 rather than fastcc.)

You can even break the tail call elimination by simply passing a value
not in the caller's incoming argument stack (or one in the wrong
position), again on x86-32:

  declare i32 @callee(i32)
  define i32 @caller(i32 %a1) {
    %a = add i32 %a1, %a1
    %1 = tail call i32 @callee(i32 %a)
    ret i32 %1
  }

x86-64 will treat more things as sibling calls because you need 7
arguments before anything reaches the stack. But the same thing will
happen on X86-64 at that point. This results in call rather than jmp:

declare i32 @callee(i32, i32, i32, i32, i32, i32, i32)
define i32 @caller(i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7) {
  %1 = tail call i32 @callee(i32 %a7, i32 %a6, i32 %a5, i32 %a4, i32 %a3, i32 %a2, i32 %a1)
  ret i32 %1
}

However, LLVM doesn't emit a tail call at -O0.

Maybe what the documentation means to say is that tail call
optimisations are only guaranteed to be done when using fastcc etc.?

It's a bit ambiguous because the author presumably intended to
distinguish tail calls from sibling calls as is the convention in the
C world.

    Further, I notice that the document includes
    "call" but not "tail call" in the list of supported instructions.

That's an omission in the document. We do actually allow the "tail
call" instruction in PNaCl.

Great.

However, we haven't specified any guarantees that a tail call
optimisation will happen. I suppose we could specify some
guarantees if people want this.

I found a thread where Evan Cheng proposed removing -tailcallopt from
LLVM. It resulted in a number of language implementers who require
guaranteed tail calls coming out of the woodwork to express their
displeasure at the idea. It also resulted in some good discussion
about why this is necessary and desirable.

  http://lists.cs.uiuc.edu/pipermail/llvmdev/2010-February/029191.html

The ability to guarantee tail call elimination is required for
implementing functional languages. The Scheme standard requires all
conforming implementations to guarantee that tail calls are
eliminated.

Microsoft would probably have never tried building F# on the CLR if
the CLR could not promise this.

Functional languages see tail call elimination as a matter of language
semantics rather than as an optimization.

Maybe we could say that a "tail call" instruction is guaranteed to
be turned into a real tail call if the callee function has the same
number of arguments as the caller function or fewer? I think that
would work on all the architectures PNaCl targets.

Well, it would certainly be better than not doing it, and perhaps it
could be a step toward broader support, but it doesn't really solve
the issue as functional languages rely on the semantics of tail calls
being eliminated in all possible cases.

What you're describing here is more general than sibling calls, as
sibling calls have additional constraints such as requiring all
arguments passed to the callee on the stack come from the caller's
incoming argument stack.

We really should find a plan for achieving support for full tail call
elimination. Otherwise we'll end up with a rather embarrassing
regression as compared to the CLR (and why should Microsoft have all
the fun?).

Then we would have to change -O0 translation so that the guarantee
is provided consistently at all optimisation levels.

That'd be ideal, yes. When the client runs llc, is it going to run it
at a fixed or client determined optimization level or one specified by
metadata that follows the code?

Best,

-- TC

That’d be ideal, yes. When the client runs llc, is it going to run it
at a fixed or client determined optimization level or one specified by
metadata that follows the code?

You mean for PNaCl? The developer can run LLVM’s optimizations on his machine and generate a pexe. The pexe and manifest are then served to users, and this manifest specifies which optimization level to use to translate the pexe to a nexe on user’s devices.

You can therefore choose between SelectionDAG and FastISel (fast code or fast translation) on the user’s device without giving up all performance because the non-target-specific optimizations have been executed already.

This is something that we’ve benchmarked fairly closely and plan to improve.