Named Register Implementation

Folks,

So, I think we kind of agree that some named registers could be
implemented, and that it should be an intrinsic that passes the name
of the register down.

This C code:

register unsigned long current_stack_pointer asm("sp");
unsigned long get_stack_pointer_addr() {
  return current_stack_pointer;
}
void set_stack_pointer_addr(unsigned long addr) {
  current_stack_pointer = addr;
}

Would become something like:

define i32 @get_stack_pointer_addr() nounwind {
entry:
  %0 = call i32 @llvm.read_register("sp")
  ret i32 %0
}

define void @set_stack_pointer_addr(i32 %addr) nounwind {
entry:
  call void @llvm.write_register("sp", %addr)
  ret void
}

Note particularly:
- There are no globals defined. Since you can't take the address of
that variable, you can only read and write directly to it, it should
be safe to translate all reads and writes to intrinsics without any
reference to a global variable.
- I'm letting the name to be an argument (as opposed to metadata) for
simplicity. It might be better to use metadata or some other way.

Is that a reasonable expectation of the implementation?

Now, onto specifics...

1. RegisterByName

I couldn't find a way to get a register by name. I could teach the
TableGen backend to print an additional table with a StringSwitch on
<Target>GenRegisterInfo.inc and add a method getRegisterByName(char
*), but that would expose a huge number of unwanted register classes
which could open a huge can of worms.

My idea was to be very specific and let the implementation local as
<Target>RegisterInfo::getNamedRegister(char *) which will be *just*
for named registers and will only map a few cases,
erring/warning/asserting otherwise.

With this, we can control what kind of register we do accept, and only
expand the list when we actually implement support for them.
Currently, the only registers we will support are the non-allocatable
ones, mainly stack and program counters.

Since this is target specific, we could make the default behaviour to
be emitting "Register not available for named register globals" error.

Is there a better way of doing this?

2. Name in SDNode

If I leave the name as a string literal ("sp"), would that become an
i8* SDNode? I'm a big rusty on SDNodes, but if I get a
Node->getValue() on an i8*, I'll get the address of it. How do I get
the value as a string?

If I would use metadata, how do I pass it to the intrinsic?

@llvm.read_register(!0)
!0 = metadata !{metadata !"sp"}

Is this valid IR?

cheers,
--renato

Folks,

So, I think we kind of agree that some named registers could be
implemented, and that it should be an intrinsic that passes the name
of the register down.

This C code:

register unsigned long current_stack_pointer asm("sp");
unsigned long get_stack_pointer_addr() {
  return current_stack_pointer;
}
void set_stack_pointer_addr(unsigned long addr) {
  current_stack_pointer = addr;
}

Would become something like:

define i32 @get_stack_pointer_addr() nounwind {
entry:
  %0 = call i32 @llvm.read_register("sp")
  ret i32 %0
}

define void @set_stack_pointer_addr(i32 %addr) nounwind {
entry:
  call void @llvm.write_register("sp", %addr)
  ret void
}

Note particularly:
- There are no globals defined. Since you can't take the address of
that variable, you can only read and write directly to it, it should
be safe to translate all reads and writes to intrinsics without any
reference to a global variable.
- I'm letting the name to be an argument (as opposed to metadata) for
simplicity. It might be better to use metadata or some other way.

Is that a reasonable expectation of the implementation?

Now, onto specifics...

1. RegisterByName

I couldn't find a way to get a register by name. I could teach the
TableGen backend to print an additional table with a StringSwitch on
<Target>GenRegisterInfo.inc and add a method getRegisterByName(char
*), but that would expose a huge number of unwanted register classes
which could open a huge can of worms.

My idea was to be very specific and let the implementation local as
<Target>RegisterInfo::getNamedRegister(char *) which will be *just*
for named registers and will only map a few cases,
erring/warning/asserting otherwise.

With this, we can control what kind of register we do accept, and only
expand the list when we actually implement support for them.
Currently, the only registers we will support are the non-allocatable
ones, mainly stack and program counters.

Since this is target specific, we could make the default behaviour to
be emitting "Register not available for named register globals" error.

Is there a better way of doing this?

2. Name in SDNode

If I leave the name as a string literal ("sp"), would that become an
i8* SDNode? I'm a big rusty on SDNodes, but if I get a
Node->getValue() on an i8*, I'll get the address of it. How do I get
the value as a string?

Couldn't you avoid using a string literal in the SDNode, by having the
SelectionDAGBuilder look up the register id with the function you
proposed above and then add that id as the operand to the intrinsic's
SDNode?

-Tom

How are you going to represent the side-effects of such builtins?

-K

That's a good idea, change the argument to an id on
SelectionDAGBuilder::visitIntrinsicCall() and use the TargetLowering
to retrieve the ids.

Thanks!
--renato

Do you mean the possible adverse effects they can have on the program
if you write to the wrong register at the wrong time? I don't think we
should represent it at all.

That brings me to a related question: is there an attribute that makes
sure those intrinsic calls don't get moved around, or hoisted out of
loops, etc? A function or parameter attribute that means the same as
"load volatile"?

cheers,
--renato

I mean how do you make sure that the "write" builtin does not look like dead code, and at the same time it's not treated as something that "changes everything". Do you expect a read of "sp" to have a dependency with a write of "ax"? If not, how is that going to be communicated to the optimizer?

-K

I mean how do you make sure that the "write" builtin does not look like dead
code, and at the same time it's not treated as something that "changes
everything".

On the IR level, I expect a call to an intrinsic to never be pruned.
But I also need more guarantees regarding code movement, which I'm not
there is. Function calls that are not const can't be moved around, so
I expected that an intrinsic (being a function call in the IR level)
without any annotation regarding its safety would guarantee that it
doesn't happen.

That could be very optimistic, I know, thus my new thread asking for
those hard questions. :wink:

Do you expect a read of "sp" to have a dependency with a write
of "ax"?

No. On the IR level you don't have to worry about registers yet, and
at the DAG level, that intrinsic will be converted to a
CopyToReg/CopyFromReg, so the dependency is clearly on that register
only from there onwards.

If not, how is that going to be communicated to the optimizer?

The high-level IR optimisation passes should treat the intrinsics as
unassuming function calls, and the low-level pattern-matching
optimizations will already have the register in hand.

If you're not concerned about those, can you be more specific?

thanks,
--renato

If foo doesn't read memory, then it's legal to interchange these two:
   store x, 1
   call foo()

It means that you can, in fact, move a call around as long as some restrictions are met. Even if you restrict the intrinsics to only apply to non-allocatable registers, can you guarantee that such apparently safe code motion won't alter these registers? A, perhaps hypothetical, example could be reading "sp" on x86, while the code that have been moved around caused spill and "push/pop" to be generated. Even if this example is only hypothetical, couldn't something of that nature happen in practice?
On the other hand, if you treat the intrinsics as accessing memory, it would be strictly worse than inline asm.

-K

If foo doesn't read memory, then it's legal to interchange these two:
  store x, 1
  call foo()

I know that, but the question is, actually, how to you represent the
call to foo in IR in order for it *not* to move around? Isn't it the
default behaviour for function calls without any attributes?

It means that you can, in fact, move a call around as long as some
restrictions are met.

Then lets make this call introduce as many problems as possible for
the restrictions never to be met.

Even if you restrict the intrinsics to only apply to
non-allocatable registers, can you guarantee that such apparently safe code
motion won't alter these registers?

It'll never be. More specifically, the semantics are normally only
valid in code order. The compiler cannot infer the safety about these
writes. Ever.

A, perhaps hypothetical, example could
be reading "sp" on x86, while the code that have been moved around caused
spill and "push/pop" to be generated. Even if this example is only
hypothetical, couldn't something of that nature happen in practice?

Absolutely, and we don't really care. It's up to the user to make sure
that his/her code is minimal enough so that these things don't happen.

*ALL* uses of it I could find in the kernel's unwind code only reads
from the stack pointer, never writes to it. I think whoever uses this
kind of trick should be well aware that "here be dragons" and there
are no guarantees of any safety due to compiler allocation, spillage,
etc.

On the other hand, if you treat the intrinsics as accessing memory, it would
be strictly worse than inline asm.

Yup. That's the idea. :wink:

I'll use the same argument I've used for __builtin___clear_cache:

Users of those extensions know *precisely* what they're getting into,
and they *only* do it because there is no better way of doing this.
Performance may be a problem in this case (it wasn't in the cache's),
but the order in which the instructions are scheduled won't affect
*that* much in this case, and some scheduling inefficiencies are
accepted due to the quirky nature of the extension.

Other arguments I heard:

Uses of those extensions only exist because the system can't cope with
their needs, and by extension, the compiler shouldn't be able to judge
what's best either. So the best course of action is to do exactly what
the user asks and let them figure out what's the best way to write
their own non-portable non-standard code.

We're bound to hit odd problems like these when compiling the kernel,
it was just a matter of time...

cheers,
--renato

Should that be something like;

declare void @llvm.write_register.sp(i32 val)
declare i32 @llvm.read_register.sp()

define i32 @get_stack_pointer_addr() nounwind {
entry:
  %0 = call i32 @llvm.read_register.sp()
  ret i32 %0
}

define void @set_stack_pointer_addr(i32 %addr) nounwind {
entry:
  call void @llvm.write_register.sp(%addr)
  ret void
}

Then you don't need to clutter the IR with metadata or a constant i8* to
identify the register.
There's precedent for overloading intrinsics.

I'd like to separate this into two different functionalities:

(1) Reserve registers, so that normal allocation won't use them.
This can be done on a global or function level.

(2) Provide intrinsincs to read/write a given register.

The former is also required to implement -ffixed-XXX, the second would
be used to used to implement named registers in combination with (1).

Joerg

There are two problems with that:

1. Front-ends have to know all registers that are supported by all
LLVM back-ends for that functionality. If the overloading is
target-independent but language-dependent, it makes sense. But
especially in this case, where we'll implement the feature in multiple
steps, it'll be very complicated to coordinate all parts.

2. I'd have to create one multiclass for the intrinsic TableGen
pattern for each supported architecture. Given that Intrinsics.td is
largely (if not completely) target independent, I'd hate to start a
pattern that should be done in the first place.

Your argument against adding a global string variable or a metadata
node is valid, this is why my first proposal is to use the string
directly from the asm() construct.

However, if your argument is to use register.sp as "the stack
pointer", that would make front-ends understand which register is the
stack pointer on each backend, which is something I don't think they
should do.

cheers,
--renato

I'd like to separate this into two different functionalities:

We will. But in reverse order.

(1) Reserve registers, so that normal allocation won't use them.
This can be done on a global or function level.

This is the most controversial part of the proposal and is the least
important to make all known cases work. I believe that having access
to registers in that way can only be explained by the inefficiency of
the compiler (for noticing that a register variable should not spill)
and could be easily changed to be a hard rule by a given flag, so that
register pressure still wouldn't spill the variable.

This is similar to named register, but with the additional benefit
that the compiler is free to choose the register, while in the second
case, the user does. It also would work with all standard C code
already available using the `register` keyword. But that's for another
discussion...

(2) Provide intrinsincs to read/write a given register.

That's the idea right now. Just that. No guarantees. Aperture Labs
kind of science (not Black Mesa).

cheers,
--renato

I'd prefer to use

declare void @llvm.write_register(i32 regno, i32 val)
declare i32 @llvm.read_register(i32 regno)

where regno is the DWARF name or a special reservation e.g. for IP or
SP.

Joerg

Do front-ends have that info with them? AFAICR, front-ends only emit
metadata related to variables and they let the lowering process to
deal with location. I want to add as little as possible to front-ends
on this.

If they do have, it should be trivial to translate between Dwarf
registers to TLI ones.

cheers,
--renato

I'm not sure if we have a generic mapping of textual name to DWARF name
right now, but that would be easy to provide. In terms of lowering, this
would be another mode like TLS already is?

Joerg

I disagree. It is the *easy* part to get many known users to work. This
includes a bunch of kernels, Lisp implementations etc. The rest can be
implemented on top by hand using inline asm, so this is the crucial
part.

Joerg

I disagree. It is the *easy* part to get many known users to work. This
includes a bunch of kernels, Lisp implementations etc. The rest can be
implemented on top by hand using inline asm, so this is the crucial
part.

Let me re-phrase my opinion...

From all discussions on the LLVM list, the one that has shown more

consistently as a problem was the reservation of allocatable
registers.

Some background...

My original intent was to be able to support the kernel unwinding
code. There was a discussion, in which we agree that for that a
building stack_pointer would do. But the GCC community has voiced two
concerns: there is already a feature for that widely in use, and that
feature is more generic.

Back to the LLVM list, people were concerned with the implementation
and semantics of reserving allocatable registers (yes, we do reserve
some of them specially here and there, but this is a generic
user-driven module-local reservation mechanism, which can put an
unknown strain on the register allocator. Folks agreed, then that a
two step implementation would make sense, given the original
objectives: first for non-allocatable registers, then a register
allocation mechanism could be introduced.

I agree that, to make it work like the GCC, reserving registers would
be the first thing to implement, but we're not going to support the
whole shebang from start, so we can start small. For specifically the
non-allocatable ones, named register is as good as the builtin, and
that would mean no changes to code that already use it would be
necessary, which is always a win.

The plan...

0. Map all uses of named register in the kernel: seem to be
non-allocatable registers only.
1. Implement named registers for non-allocatable registers
2. Map uses of allocatable registers (libs, dynamic languages,
kernels) and make sure there is no other way to do that.
2.a. If there is, do that.
2.b. If not, implement register reservation and expand named register
for those classes (this is contentious and why I'm avoiding it for
now)

cheers,
--renato

I disagree. It is the *easy* part to get many known users to work. This
includes a bunch of kernels, Lisp implementations etc. The rest can be
implemented on top by hand using inline asm, so this is the crucial
part.

Let me re-phrase my opinion...

From all discussions on the LLVM list, the one that has shown more
consistently as a problem was the reservation of allocatable
registers.

Some background...

My original intent was to be able to support the kernel unwinding
code. There was a discussion, in which we agree that for that a
building stack_pointer would do. But the GCC community has voiced two
concerns: there is already a feature for that widely in use, and that
feature is more generic.

Back to the LLVM list, people were concerned with the implementation
and semantics of reserving allocatable registers (yes, we do reserve
some of them specially here and there, but this is a generic
user-driven module-local reservation mechanism, which can put an
unknown strain on the register allocator. Folks agreed, then that a
two step implementation would make sense, given the original
objectives: first for non-allocatable registers, then a register
allocation mechanism could be introduced.

I agree that, to make it work like the GCC, reserving registers would
be the first thing to implement, but we're not going to support the
whole shebang from start, so we can start small. For specifically the
non-allocatable ones, named register is as good as the builtin, and
that would mean no changes to code that already use it would be
necessary, which is always a win.

The plan...

0. Map all uses of named register in the kernel: seem to be
non-allocatable registers only.
1. Implement named registers for non-allocatable registers

I'm not sure if this would work even for non-allocatable registers. I fear this may codegen to a copy from the register and subsequent reads are all of the gpr.

Evan

Hi Evan,

I'm not sure I follow what GPRs.

The plan is to transform all reads from a variable marked "register
long foo asm("reg")" into an intrinsic "@llvm.read_register("reg")".
If we also want the builtin, that's as easy as mapping
"__builtin_stack_pointer()" to "@llvm.read_register("SP")", so the
underlying implementation is the same.

So:

register unsigned long foo asm("SP"); // does nothing
long a = foo; // %a = call i32 @llvm.read_register("SP") ->
DAG.CopyFromReg("SP")
long b = a; // %0 = load %a, store %0, %b (in GPRs, since both a and
b are C variables)
long c = foo; // %c = call i32 @llvm.read_register("SP") ->
DAG.CopyFromReg("SP")

In any case, "foo" doesn't exist as a variable in the C level
representation (no alloca, etc), and since we can't take the address
of it (due to C restrictions), that's fine. If you transfer the values
to other variables, in GPRs or in the stack/memory, than all bets are
off, and I believe that this is the same in GCC, but *all*
reads/writes to named register variables should be represented by
read/write intrinsics.

cheers,
--renato