RFC: Harvard architectures and default address spaces

Hello all, I’m looking into solving an AVR-specific issue and would love to hear peoples thoughts on how to best fix it.

Background

As you may or may not know, I maintain the in-tree AVR backend, which also happens to be (to the best of my knowledge) the first in-tree backend for a Harvard architecture.

In this architecture, code lives inside the ‘program memory’ space (numbered 1), whereas data lives inside RAM “data space”, which corresponds to the default address space 0. This is important because loads/stores use different instruction/pointer formats depending on the address space used, and so we need correct address space information available to the backend itself.

Due to the fact that address spaces in LLVM default to 0, this means that all global or constant variables default to living inside data space. This causes a few issues, including the fact that the SimplifyCFG pass creates switch lookup tables, which default to data space, causing us to emit broken table lookups and also wasting precious RAM.

The problem - emitting pointers as operands

NOTE: Feel free to skip to tl;dr of this section if you don’t care too much about the details

There are different instructions which require different fixups to be applied depending on whether pointers are located in data space or program space.

Take the ICALL instruction - it performs an indirect call to the pointer stored in the Z register.

We must first load the pointer into Z via the ‘ldi’ instruction. If the pointer is actually a pointer to a symbol, we need to emit a AVR_LO8_LDI_GS relocation, otherwise we emit a AVR_LO8_LDI relocation. There are a few other cases, but they’re irrelevant for this discussion.

We can quite easily look at the GlobalValue* that corresponds to the pointer if it is a symbol and select the fixup based on that, but that assumes that the address spaces are always correct.

Now, imagine that the pointer is actually a function pointer. LLVM does not expose any way to set address space in the IR for functions, but because it derived from GlobalValue, it does have an address space, and that address space defaults to zero. Because of this, every single function pointer in the AVR backend that gets loaded by the ldi will be associated with data space, and not program space, which it actually belongs to.

tl;dr functions default to address space zero, even though they are in a different space on Harvard architectures, which causes silent codegen bugs when we rely on the address space of a global value

Proposed solution

It would be impossible to set the address space correctly on creation of llvm::Function objects because at that point in the pipeline, we do not know the target architecture.

Because of this, I’d like to extend TargetTransformInfo with hooks that like getSwitchTableAddressSpace(), getFunctionAddressSpace(). I have already got a WIP patch for this here.

Once we have that information available to TargetTransformInfo, I propose we add a pass (very early in the codegen pipeline) that sets the address space of all functions to whatever value is specified in the hooks.

This works well because we don’t let frontends specify address space on functions, nor do we even mention that functions have address spaces in the language reference.

The downside of it it is that you wouldn’t normally expect something like an address space to change midway through the compilation process. To counter that however, I doubt the pre-codegen code cares much about the value of function address spaces, if at all.

On top of this, at the current point in time, Pointer<Function>::getAddressSpace is downright incorrect on any Harvard architecture, and for other architectures, the address space for functions will still stay the default of zero and will not change at all.

Does anybody know anything I haven’t thought of? Any reasons why this solution is suboptimal?

I don't have a strong opinion on any of this, but I just thought I'd pop by
and mention that, as a front-end author, I do have to know my target
architecture in order to generate correct IR, so the above approach doesn't
necessarily seem like a problem.

Hi, Dylan, Being able to specify the address space of functions, etc. is a good idea. Given the current design, you can’t put this into TargetTransformInfo, however, because nothing in TTI may be required for correctness (because your target’s implementation might not be available). Information required for correctness must go in DataLayout (because it must always be available). You should propose patches to add this information to DataLayout and to use that information in relevant places. -Hal

Hi,

We have an out-of-tree target that is also Harvard architecture, so we're interested in this as well.

So far, we've "solved" the issue by extending the datalayout so we can specify our "function pointer address space" there and then add/use it when necessary. Our current patch for this is small but hacky (too hacky to be upstreamed), but it has done the job for us for quite some time.

Very nice if something that solves this issue for real could make it into tree.

Regards,
Mikael

funptr.diff (72.6 KB)

Hello Hal,

Add this information to DataLayout and to use that information in relevant places.

This sounds like a much better/cleaner idea, thanks!

Very nice if something that solves this issue for real could make it into tree.

I’ll keep you in the loop, thanks for the attached patch by the way.

Cheers,
Dylan

Hi,

I've been maintaining an out-of-tree backend for several years now which
requires something similar.

I think our use case is slightly more complex (functions may live in one
of two different address spaces which require different width pointers
to live in the same module simulatenously). Our approach has been mostly
to enable Functions to have address spaces (as you note, they stem from
GlobalValue so that's easily extendable), and have Clang add hidden
attributes automatically to move things around to the user's current
default AS ahead of CodeGen.

For the core of LLVM though, I like the idea of extending the DataLayout
to hold the default AS information. That sounds like the correct place
where it would be easily accessible for everywhere that might need it
(I'm thinking emulation functions in particular).

- Simon

I’d suggest taking a look at the alloca address space changes, which were recently added based on a cleaned-up version of our code. We have a similar issue (function and data pointers have the same representation for us, but casting requires different handling[1]) and have considered adding address spaces to functions.

David

[1] Probably not relevant for this discussion, but if anyone cares: in our world we have 128-bit fat pointers contain base, bounds and permissions, and that 64-bit pointers that are implicitly relative to one of two special fat pointer registers, one for code and one for data. We must therefore handle 64-bit to 128-bit pointer casts differently depending on whether we’re casting code or data pointers. We currently do this with some fairly ugly hacks, but being able to put all functions in a different AS would make this much easier for us.

My experience of having the address space for functions (or function pointers) in the DataLayout i that when the .ll file is parsed we need to parse the DataLayout before any function declarations. That is needed because we want to attribute the functions with correct address space (according to DataLayout) when inserting them in the symbol table. An alternative would be to update address space info for functions after having parsed the DataLayout.

Is the DataLayout normally used when parsing the .ll file (or .bc)? Or would this be the first case of doing that?

Is it guaranteed that DataLayout is specified/parsed before function declaration, or that the DataLayout specification is context sensitive and only is valid for the following declarations?

What if there are several address spaces for functions? Or is that a silly thing that no one ever will use? Having the address space specified in the DataLayout would be insufficient, since we would need to attribute the functions separately, right?

I do not say that having the info in the DataLayout is a totally bad idea (since our out-of-tree target is using that trick), but I think it might impose some problems as well. And perhaps it isn't the most general solution.

/Björn

My experience of having the address space for functions (or function pointers) in the DataLayout i that when the .ll file is parsed we need to parse the DataLayout before any function declarations. That is needed because we want to attribute the functions with correct address space (according to DataLayout) when inserting them in the symbol table. An alternative would be to update address space info for functions after having parsed the DataLayout.

Is the DataLayout normally used when parsing the .ll file (or .bc)? Or would this be the first case of doing that?

Is it guaranteed that DataLayout is specified/parsed before function declaration, or that the DataLayout specification is context sensitive and only is valid for the following declarations?

The DataLayout is a required part of the .ll/.bc file. In the .ll file (*), it's the part of the module that looks like this:

   target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

it is global to the entire module and always available.

(*) It is true that you can write tests without specifying one of these, but in such cases, you just get the builtin default. For all real cases you'll need to have a target-appropriate DataLayout string.

What if there are several address spaces for functions? Or is that a silly thing that no one ever will use? Having the address space specified in the DataLayout would be insufficient, since we would need to attribute the functions separately, right?

I do not say that having the info in the DataLayout is a totally bad idea (since our out-of-tree target is using that trick), but I think it might impose some problems as well. And perhaps it isn't the most general solution.

If different functions might be in different address spaces, you'll need some other mechanism to set the address space (as a single default won't suffice). You might use source-level function attributes, for example.

  -Hal

From: Hal Finkel [mailto:hfinkel@anl.gov]
Sent: den 13 juli 2017 16:01
To: Björn Pettersson A <bjorn.a.pettersson@ericsson.com>; David Chisnall
<David.Chisnall@cl.cam.ac.uk>; Dylan McKay <me@dylanmckay.io>
Cc: llvm-dev@lists.llvm.org; Carl Peto <carl.peto@me.com>
Subject: Re: [llvm-dev] RFC: Harvard architectures and default address
spaces

> My experience of having the address space for functions (or function
pointers) in the DataLayout i that when the .ll file is parsed we need to parse
the DataLayout before any function declarations. That is needed because we
want to attribute the functions with correct address space (according to
DataLayout) when inserting them in the symbol table. An alternative would
be to update address space info for functions after having parsed the
DataLayout.
>
> Is the DataLayout normally used when parsing the .ll file (or .bc)? Or would
this be the first case of doing that?
>
> Is it guaranteed that DataLayout is specified/parsed before function
declaration, or that the DataLayout specification is context sensitive and only
is valid for the following declarations?

The DataLayout is a required part of the .ll/.bc file. In the .ll file
(*), it's the part of the module that looks like this:

   target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

it is global to the entire module and always available.

My point was that the DataLayout isn't available inside the LLParser and the BitcodeReader, up until the point when it has been parsed. So I would not say "always available".

Both LLParser and BitcodeReader is for example using getAddressSpace(), both directly and maybe also indirectly through different interfaces (perhaps not on function pointers though). My concern is that maybe the incorrect address space will be used while parsing, and then it might be hard to find all places to fixup at a later stage. when having parsed the DataLayout and finding out what the default address space really is.
Or if there is some undocumented(?) rule that the DataLayout always comes before function declarations in the ll/bc file, then all functions can get the default address space attribute directly (as indicated by the DataLayout) when being parsed.

I think the whole idea from Dylan was to do this as a fixup after LLParser/BitcodeReader. I.e not trying to lookup a functions address space already when parsing the function declaration. So then the rule would be - do not use Pointer<Function>::getAddressSpace(), or PointerType::get() etc. during ll/bc parsing because it might give the wrong result. Maybe it is possible to assert on that?

We could of course give functions some kind of undefined address space value when parsing ll/bc and adding functions to the symbol table. That might help us catch situations when someone tries to fetch the address space for a function pointer before the set-default-address-space-as-given-by-datalayout-on-all-functions pass has executed.

From: Hal Finkel [mailto:hfinkel@anl.gov]
Sent: den 13 juli 2017 16:01
To: Björn Pettersson A <bjorn.a.pettersson@ericsson.com>; David Chisnall
<David.Chisnall@cl.cam.ac.uk>; Dylan McKay <me@dylanmckay.io>
Cc: llvm-dev@lists.llvm.org; Carl Peto <carl.peto@me.com>
Subject: Re: [llvm-dev] RFC: Harvard architectures and default address
spaces

My experience of having the address space for functions (or function

pointers) in the DataLayout i that when the .ll file is parsed we need to parse
the DataLayout before any function declarations. That is needed because we
want to attribute the functions with correct address space (according to
DataLayout) when inserting them in the symbol table. An alternative would
be to update address space info for functions after having parsed the
DataLayout.

Is the DataLayout normally used when parsing the .ll file (or .bc)? Or would

this be the first case of doing that?

Is it guaranteed that DataLayout is specified/parsed before function

declaration, or that the DataLayout specification is context sensitive and only
is valid for the following declarations?

The DataLayout is a required part of the .ll/.bc file. In the .ll file
(*), it's the part of the module that looks like this:

    target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

it is global to the entire module and always available.

My point was that the DataLayout isn't available inside the LLParser and the BitcodeReader, up until the point when it has been parsed. So I would not say "always available".

Both LLParser and BitcodeReader is for example using getAddressSpace(), both directly and maybe also indirectly through different interfaces (perhaps not on function pointers though). My concern is that maybe the incorrect address space will be used while parsing, and then it might be hard to find all places to fixup at a later stage. when having parsed the DataLayout and finding out what the default address space really is.
Or if there is some undocumented(?) rule that the DataLayout always comes before function declarations in the ll/bc file, then all functions can get the default address space attribute directly (as indicated by the DataLayout) when being parsed.

I think the whole idea from Dylan was to do this as a fixup after LLParser/BitcodeReader. I.e not trying to lookup a functions address space already when parsing the function declaration. So then the rule would be - do not use Pointer<Function>::getAddressSpace(), or PointerType::get() etc. during ll/bc parsing because it might give the wrong result. Maybe it is possible to assert on that?

We could of course give functions some kind of undefined address space value when parsing ll/bc and adding functions to the symbol table. That might help us catch situations when someone tries to fetch the address space for a function pointer before the set-default-address-space-as-given-by-datalayout-on-all-functions pass has executed.

I understand your point. We need to extend the syntax to enabling tagging functions with addressspaces directly (i.e. as a function attribute). DataLayout can't, as you noted, and shouldn't, be used to adjust defaults during parsing. The point of the defaults in DataLayout is to provide some way to supply correct defaults should optimizations need them.

  -Hal

My point was that the DataLayout isn’t available inside the LLParser and the BitcodeReader, up until the point when it has been parsed. So I would not say “always available”.

I have noticed this.

Normally, the frontend compilation adds functions, and them emits some form of code. There are no issues with normal .o or .s compilation, but if we emit IR, we lose the address space information because it is not actually a part of the textual IR form.

The standard LLParser process has no data layout available until the end of the process (because we create an empty temporary module for parsing).

Because of both of these problems, the best way to implement this is modify the textual IR form to support address space attributes on function declarations.

I think the whole idea from Dylan was to do this as a fixup after LLParser/BitcodeReader

Originally, that was the only workable solution I could come up with, but I think the solution Hal suggested is better.

As I understand it:

  • New functions created by the frontend will default to the address space specified in the DataLayout
  • When these functions get lowered to LL, we emit the address space on the function like ‘addrspace 1’ if it is not the default address space 0
  • When parsing, all functions will either have no addrspace attribute and default to ‘0’, or take the addrspace specified

This will let frontends specify explicit address spaces manually (which would support Mikael and others usecases), and also keep track of the address space throughout the pipeline so that we don’t make any incorrect assumptions of the space.