Tail-calling

You can skip this first and second paragraph if you just want to get
to the description, the 'real' question I want is the bottom most
paragraph and may still make enough sense to read without reading
everything else, but there are a few other hidden questions and
opinions being asked in the rest of it.

In this little scripting language I am making is basically going to be
a C with Actors (like how C++ started out essentially as C with
Classes). Just as C++ is not a 'real' OO style (little tid-bit, the
person who originally coined object-oriented programming was actually
talking about something like the Actor-oriented programming, but I am
using common terms here), but it is 'enough' to make things useful, I
am going to put in just enough Actor type stuff to make it useful,
while trying to retain as much speed as possible. For the sake of the
Actor model I need to strip out a lot of standard C constructs (no
more mutable globals for example) so I am basically just keeping the C
syntax (I may still allow mutable globals, just may trust the coder to
'do-the-right-thing', and otherwise give plenty of warnings to scare
away the newbies). The purpose of this little scripting language is
to do a lot of math processing (and other light-weight event handling
and such), but it needs to scale, not just to multiple cores, but
multiple computers. I already have a back-end written in C++ for this
that I use for C++ apps (with a ton of scaffolding, the purpose of
this little language is to get rid of the ugly scaffolding and make it
into something pleasant to write, while allowing other users of my app
to code for it as well) as well as a design structure of how the
system will work, so the 'hard' part is pretty done, now it is time
for the tedious part.

On the level of the language, not the implementation, I plan on having
two types of functions in this app, one will function like normal C
function, return, arguments, etc., nothing special, the second type
will basically have no return value (when it returns it 'ends', will
be cleared up shortly, however tasklet calls later on the stack can
return values, but the 'main' tasklet function, the one that when it
returns the tasklet dies, it cannot have a return value). There will
be a couple built-in functions in the language (most likely only 3,
the rest will be built from those inside the language itself, these 3
will hereafter be referred to as switching functions) that will be
capable of 'suspending' a (using Stackless Python terminology here)
tasklet and switching to something else. A tasklet will be created by
an explicit call, passing in the function that will run the tasklet.

Internally normal functions will be as normal, nothing special. The
tasklet style functions are unique, basically a function will be a
tasklet capable function if it has one of those switching functions
anywhere down in its calling queue. Normal functions cannot call
tasklet functions, but tasklet function can call normal function.

A switch occurs when an explicit switching call is made, or the
bottom-most tasklet function on the tasklet stack returns. When a
switch occurs, one of two things will happen depending on the what the
scheduler that it is assigned to wants to do, it will either (at the
llvm level) return, or it will tail-call another tasklet function,
which at the llvm level means calling the function of a tasklet that
is defined to be the continuation function.

As you can probably now see from the design, this relies on (proper)
tail-calls working... all the time.

When the code is compiled, normal functions are as normal, but tasklet
functions, if they call a normal function, will call it as normal,
registers, stack, everything. However, if a tasklet function calls
another tasklet function, then at the LLVM level the function will be
split into a couple functions (with further split happening if there
are more tasklet function calls later on) where the function will
become a tail-call.

At this point, refer to the link
http://nondot.org/sabre/LLVMNotes/ExplicitlyManagedStackFrames.txt
that I am using as a basis, although with some minor changes in passed
around data.

First difference, I am not using a garbage collector (not at this
moment anyway, and the design of the language may not require that I
need one at all actually, most things will be on the fake tasklet
stack, which is managed by the scheduler, the rest will mostly just be
messages, which will be managed internally by a different system, I
originally made those behind the scenes systems for C++). The rest is
pretty similar, except that I have two styles in mind, one is much
different then what he put.

Since the tasklet functions cannot be called from a function pointer,
only compiled in directly, and since normal functions can still be
called any which way since they will always return before a switch
happens anyway (otherwise they would have been a tasklet function), I
will know exactly how much memory a whole tasklet call stack will need
down to its deepest point. I also do not work toward recursion for
looping, it will still be C++ style for that, so tail-call
optimization inside the language itself is not necessary.

Because of this though, I have two 'obvious' ways to manage the memory
for a tasklet. First, I can do it of the sabre method and allocate
(from the scheduler) the memory for each continuation individually as
they occur, seems overly costly and inefficient, especially since I
have all necessary information, however it is nice in that all
necessary memory is not allocated all at once, in other words, only
the memory being used will be allocated, no 'dead-weight'. The second
method is to allocate all the memory a tasklet will need down to its
deepest tasklet call on its call stack (which in the generic case will
still not be that much, probably a few hundred bytes max in the
general case considering the system I will be using this on). This
should be faster due to the lack of needing to get memory for each
tasklet function invocation, but it will have 'dead-weight' in the
parts of it that are unused, so potentially a tasklet stack could be
very light, except for one deep call that may need a few kb or few
hundred kb of space down in a function that may only be called once on
that callstack, so there would be a tremendous amount of unused memory
during the rest of the life of that tasklet.

However, I could combine the two method, either by doing it through
the compiler (heuristically guessing?), or perhaps an explicit
language syntax such as a qualifier for the function declaration (I
would think the full allocation one would be the norm, but perhaps on
a deep function that may need lots of memory, and still switch, it
would be defined to only allocate that memory upon its creation), or
perhaps a slightly different way of calling the function (so instead
of it being function defined, it could be caller defined, would give
more control so it could be used better depending on the circumstance,
but that is generally more work for the coder, it is language design
and not really relevant here, but I am curious on thoughts
regardless).

The broken (non-first, the second and on) tasklet function signature
would probably just be something along the lines of:
schedulerRelevantEnum aTaskletFunctionName(topOfStackOrContinuationPtr
*ptr, someType ReturnValueFromFunctionBreak)
Basically it would follow sabre's document in this regard pretty
closely. The top most call of a tasklet function (before splitting)
would not contain the return value (since something is calling it, nor
would other tasklet function calls that return nothing. As stated
though, it is pretty identical to sabres method in the above link, use
it, and if the nameserver is down again use the Google cache at
http://google.com/search?q=cache:http://nondot.org/sabre/LLVMNotes/ExplicitlyManagedStackFrames.txt&hl=en&rlz=1C1GGLS_enUS291&sa=G&strip=1

So, the main question, is this style now feasible on modern x86 (32
and 64 bit) based systems (both *nix and Windows based) good for LLVM?
In other words, are tail-calls always working now when the function
call happens at the end, even if the function is being called through
a function pointer (are tail-calls on function pointers working
correctly right now, or is this whole project but for naught till
later)?

Tail calls are disabled by default; to enable them, you have to pass
-tailcallopt to llc. It looks like there's a bug preventing indirect
tail calls from working... I don't think there's anything
fundamentally preventing them from working, though.

-Eli

Tail calls are disabled by default; to enable them, you have to pass
-tailcallopt to llc. It looks like there's a bug preventing indirect
tail calls from working... I don't think there's anything
fundamentally preventing them from working, though.

I will not be using llc, it will all be through the C++ API. I looked
through the code a bit, but could not see any real direct way to turn
on tall-calls, the only thing I could really find was a single global
(yes, a global... ugh...) that many things referenced, if I set that
then things should use tail-call versions of things.

What about this indirect bug, is anyone working on it? I would try,
but I am quite certain I do not yet have the LLVM internals knowledge
to (or even where to look).

Arnold implemented tail call. We should all urge him to continue his work to make it even better. :slight_smile:

Evan

Tail calls through function pointers should work.If not please send a testcase.

I just added the two examples from the bug (1392) that calls for true
tail call support. They work on my machine (-tailcallopt needs to be
enabled) :wink:

That would be commit 56127.

regards

Quick question, how do you enable tail calls for the JIT when using
the API, not an external app; in my program it is all internal,
nothing external is called and the only thing I saw was that one
global on a quick look through.

I'm trying to implement *MUL_LOHI for my processor.

My processor has mulxss (e.g.) that gives the 32 high bits of a 64 bit multiply.

I tried this in ios2ISelDAGToDAG.cpp:
    /// Mul/Div with two results
     case ISD::SMUL_LOHI:
     case ISD::UMUL_LOHI: {
       SDValue Op1 = Node->getOperand(0);
       SDValue Op2 = Node->getOperand(1);
       AddToISelQueue(Op1);
       AddToISelQueue(Op2);

       unsigned Op;
       Op = (Opcode == ISD::UMUL_LOHI ? Nios2::MULxu : Nios2::MULx);
       SDNode *Hi = CurDAG->getTargetNode(Op, MVT::Flag, Op1, Op2);
       SDNode *Lo = CurDAG->getTargetNode(Nios2::MUL, MVT::Flag, Op1, Op2);
       if (!N.getValue(0).use_empty())
         ReplaceUses(N.getValue(0), SDValue(Lo,0));
       if (!N.getValue(1).use_empty())
         ReplaceUses(N.getValue(1), SDValue(Hi,0));
       return NULL;
     }

The code generator complains:
nios2-elf-ecc: /home/rich/llvm-trunk-new/lib/CodeGen/SelectionDAG/ScheduleDAG.cpp:141: void llvm::ScheduleDAG::BuildSchedUnits(): Assertion `N->getNodeId() == -1 && "Node already inserted!"' failed

I'm guessing that's because I'm reusing Op1 and Op2.
What is the right way to reuse DAG operands?

-Rich

I haven't looked at the rest of the email carefully, but why aren't
you just implementing MULHU and MULHS? There's no point to
implementing the *MUL_LOHI variants if the processor doesn't have
them.

-Eli

For marking a call site as a tail call you have to call void
setTailCall(bool isTC = true) on the CallInst. The calling convention
of the function needs to be CallingConv::Fast (currently only fastcc
calls are optimized) and the call has to be in tail position of
course.

To enable tail call optimization in an embedded vm i guess you would
include llvm/Target/TargetOptions.h and set the static variable
PerformTailCallOpt to true.

Take this with a grain of salt or two since i have never done it :wink:
but i guess that setting those globals is what
cl::ParseCommandLineOptions does in lli/llc.cpp.

regards arnold

The function performing the tail call also needs to be CallingConv::Fast.

Some info about tail call optimization is located at
<http://llvm.org/docs/CodeGenerator.html#tailcallopt&gt;\.
Some examples are in > ls /llvm/test/CodeGen/X86/tailcall*

regards arnold

This isn't due to node reuse. Rather it's the scheduler complaining of a malformed DAG. You should dump out the DAG before scheduling and look for multiple use of a FLAG value.

Evan

Arnold Schwaighofer wrote:

Take this with a grain of salt or two since i have never done it :wink:
but i guess that setting those globals is what
cl::ParseCommandLineOptions does in lli/llc.cpp.

I can confirm that this is actually how it works (fastcc on the function + setTailCall on the call instruction + llvm::PerformTailCallOpt = true). I might add that of course you also have to make sure that the tail call is immediately before a ret instruction. :slight_smile:

Note that it also depends on the JIT for the target architecture whether this is actually supported. I verified that this does work on x86, -64. Not sure about the status on other archs, though, it seems that at least the LLVM 2.2 JIT didn't do any proper tail calls on ppc yet.

Albert

Hi Albert,

nice to see someone is using the tail call stuff. i am responsible for
the mess (feature) ;).

Tail call optimization currently (llvm 2.3) works (to varying degrees,
see http://llvm.org/docs/CodeGenerator.html#tailcallopt)
for x86(-64) and ppc32/64. x86 being more mature (testing) than ppc.
ppc64 has received the least testing.

regards

Eli Friedman wrote:

I haven't looked at the rest of the email carefully, but why aren't
you just implementing MULHU and MULHS? There's no point to
implementing the *MUL_LOHI variants if the processor doesn't have
them.

I have implemented MULHU and MULHS. But if I take out my *MUL_LOHI stuff, the error I get is

[~/ellcc/ellcc] main% ./nios2-elf-ecc -S test.c
Cannot yet select: 0xaf93a34: i32,i32 = umul_lohi 0xaf9345c, 0xaf93924

What could I be doing to make the code generator think that umul_lohi is legal?

-Rich

Richard Pennington wrote:

Eli Friedman wrote:
  

I haven't looked at the rest of the email carefully, but why aren't
you just implementing MULHU and MULHS? There's no point to
implementing the *MUL_LOHI variants if the processor doesn't have
them.
    
I have implemented MULHU and MULHS. But if I take out my *MUL_LOHI stuff, the error I get is

[~/ellcc/ellcc] main% ./nios2-elf-ecc -S test.c
Cannot yet select: 0xaf93a34: i32,i32 = umul_lohi 0xaf9345c, 0xaf93924

What could I be doing to make the code generator think that umul_lohi is legal?

-Rich
  

In your target lowering you need to set the operation action for ISD::*MUL_LOHI to expand otherwise it will be assumed to be legal.

Richard

Richard Osborne wrote:

Richard Pennington wrote:

In your target lowering you need to set the operation action for ISD::*MUL_LOHI to expand otherwise it will be assumed to be legal.

Richard

Ah! Thanks! That's the bit I was missing.

-Rich

Arnold Schwaighofer wrote:

nice to see someone is using the tail call stuff. i am responsible for
the mess (feature) ;).

Thanks a bunch. Without it, my FPL (http://pure-lang.sf.net) would be much less useful because it doesn't have a built-in loop construct. :slight_smile:

Tail call optimization currently (llvm 2.3) works (to varying degrees,
see http://llvm.org/docs/CodeGenerator.html#tailcallopt)
for x86(-64) and ppc32/64. x86 being more mature (testing) than ppc.
ppc64 has received the least testing.

That's good news, thanks for the info!

Cheers,
Albert

Thanks for the confirmation on use. I am needing it to create a C
syntax actor based scripting language for my program (I may include an
optional assembler and linker for turning the 'cache' files into real
loading loadable libraries, but that will be further down the road, I
will be adding in pathways for such a thing being added however), I
will not use it for loops and such, but I will be slicing functions up
at call boundaries of other functions that are capable of 'pausing
execution' (with a return statement immediately following, which may
never actually be reached in some builds actually, hence tail-call
optimization is pretty important).

The only platform I need it for personally is x86(-64), and
considering the large amount of languages out there already I doubt
anyone would really be interested in mine, but if anyone ever was then
I would add support for having it compile down to a native executable,
which could involve needing multiple architecture support.

I am still split on whether I should offer things like a 128-bit
native integer on 64-bit platforms, when such things would not be
available on a 32-bit platform, or if I should just go with lowest
common denominator (just x86) so the code 'just works'...

AFAIK, 128-bit integers should work just fine on x86 with LLVM.

-Eli

Bad example then, I meant I am wondering about how I should have my
language handle things that are platform specific, whether I should
expose them and let the user make 'best judgment calls' which could
break on the distributed system when there are multiple programmers,
or if I should just keep it to a limited instruction set that works
mostly everywhere. Thoughts?

Does the x86 (32-bit) backend really support 128-bit integers, or does
LLVM just play nice by splitting it up among registers or something,
and if it does then what is the maximum 'reasonable' size for an x86
32-bit integer?