GHC, aliases, and LLVM HEAD

Peter: Please feel free to correct me if there are any inaccuracies
       below.

For a while now LLVM has started rejecting aliases referring to things
other than definitions[1]. This unfortunately breaks GHC's LLVM code
generator which referes to most symbols through aliases. This is done in
two situations,

  1. As place-holders for external symbols. As the code generator does
     not know the type of these symbols until the point of usage (nor
     does it need to), i8* aliases are defined at the end of the
     compilation unit,

         @newCAF = external global i8
         @newCAF$alias = alias private i8* @newCAF

     and functions in the current compilation unit calling `newCAF` invoke
     it through `@newCAF$alias$`,

         ...
         %lnYi = bitcast i8* @newCAF$alias to i8* (i8*, i8*)*
         ...

  2. As place-holders for local symbols. All symbol references in
     emitted functions are replaced with references to aliases. This is
     done so that the compiler can emit LLVM IR definitions for
     functions without waiting for symbols they reference to become
     available (as our internal representation, Core, allows references
     in any order without forward declarations). This theoretically
     offers a performance improvement and somewhat simplifies the code
     generator. Here we emit aliases like,

         @SWn_srt$alias = alias private i8* bitcast (%SWn_srt_struct* @SWn_srt to i8*)

     again, using the `$alias` in all references,

Unfortunately, recent LLVMs reject both of these uses. The first is
rejected as aliases can no longer reference items other than
definitions, e.g.

    opt: hi.ll:414:36: error: Alias must point to function or variable
    @SWn_srt$alias = alias private i8* bitcast (%SWn_srt_struct* @SWn_srt to i8*)

The second is rejected as aliasees must[2] be global objects, which
bitcasts are not,

    /home/ben/trees/root-llvm-head/bin/opt: utils/hpc/dist-install/build/HpcParser.ll:44714:37: error: Alias must point to function or variable
    @c3rB_str$alias = alias private i8* bitcast (%c3rB_str_struct* @c3rB_str to i8*)
                                        ^

Is our (ab)use of aliases reasonable? If so, what options do we have to
fix this before LLVM 3.5? If not, what other mechanisms are there for
addressing the use-cases above in GHC?

Thanks,

- Ben

[1] Reject alias to undefined symbols in the verifier. · llvm-mirror/llvm@38048cd · GitHub
[2] llvm/LLParser.cpp at 68b0d1d2b47f1be8eec2ce57c8119906c354ccd8 · llvm-mirror/llvm · GitHub

Peter: Please feel free to correct me if there are any inaccuracies
       below.

For a while now LLVM has started rejecting aliases referring to things
other than definitions[1].

We started checking for it. Aliases are just another label in an
object file. The linker itself doesn't know they exist and therefore
there is no way to represent an alias from foo to bar if bar is
undefined.

This unfortunately breaks GHC's LLVM code
generator which referes to most symbols through aliases. This is done in
two situations,

  1. As place-holders for external symbols. As the code generator does
     not know the type of these symbols until the point of usage (nor
     does it need to), i8* aliases are defined at the end of the
     compilation unit,

         @newCAF = external global i8
         @newCAF$alias = alias private i8* @newCAF

     and functions in the current compilation unit calling `newCAF` invoke
     it through `@newCAF$alias$`,

         ...
         %lnYi = bitcast i8* @newCAF$alias to i8* (i8*, i8*)*
         ...

Sorry, I don't see what this buys you. The types of newFAC and
newCAF$alias are the same.

  2. As place-holders for local symbols. All symbol references in
     emitted functions are replaced with references to aliases. This is
     done so that the compiler can emit LLVM IR definitions for
     functions without waiting for symbols they reference to become
     available (as our internal representation, Core, allows references
     in any order without forward declarations). This theoretically
     offers a performance improvement and somewhat simplifies the code
     generator. Here we emit aliases like,

         @SWn_srt$alias = alias private i8* bitcast (%SWn_srt_struct* @SWn_srt to i8*)

     again, using the `$alias` in all references,

That should also work in llvm IR. You can create a function without a
body or a GlobalVariable without an initializer and add it afterwards.
Check for example what llvm-as does when a variable or a function is
used before it is defined. Doesn't that work for you?

Unfortunately, recent LLVMs reject both of these uses. The first is
rejected as aliases can no longer reference items other than
definitions, e.g.

    opt: hi.ll:414:36: error: Alias must point to function or variable
    @SWn_srt$alias = alias private i8* bitcast (%SWn_srt_struct* @SWn_srt to i8*)

The second is rejected as aliasees must[2] be global objects, which
bitcasts are not,

    /home/ben/trees/root-llvm-head/bin/opt: utils/hpc/dist-install/build/HpcParser.ll:44714:37: error: Alias must point to function or variable
    @c3rB_str$alias = alias private i8* bitcast (%c3rB_str_struct* @c3rB_str to i8*)
                                        ^

Is our (ab)use of aliases reasonable? If so, what options do we have to
fix this before LLVM 3.5? If not, what other mechanisms are there for
addressing the use-cases above in GHC?

It looks fairly likely llvm will accept arbitrary expressions as
aliasees again (see thread on llvmdev), but the restrictions inherent
from what alias are at the object level will remain, just be reworded
a bit. For example, we will have something along the lines of "the
aliasee expression cannot contain an undefined GlobalValue".

Cheers,
Rafael

Rafael Espíndola <rafael.espindola@gmail.com> writes:

For a while now LLVM has started rejecting aliases referring to things
other than definitions[1].

We started checking for it. Aliases are just another label in an
object file. The linker itself doesn't know they exist and therefore
there is no way to represent an alias from foo to bar if bar is
undefined.

Sure. I think the only reason our use of aliases worked previously was
that the optimizer elided them long before they could make it into an
object file.

  1. As place-holders for external symbols. As the code generator does
     not know the type of these symbols until the point of usage (nor
     does it need to), i8* aliases are defined at the end of the
     compilation unit,

As it turns out this wasn't quite right; there are some cases that we
don't know the type of the reference even at the point of usage (namely
when we refer to the function's entrypoint label without calling it as
only C--'s call node contains the signature).

         @newCAF = external global i8
         @newCAF$alias = alias private i8* @newCAF

     and functions in the current compilation unit calling `newCAF` invoke
     it through `@newCAF$alias$`,

         ...
         %lnYi = bitcast i8* @newCAF$alias to i8* (i8*, i8*)*
         ...

Sorry, I don't see what this buys you. The types of newCAF and
newCAF$alias are the same.

It seems you are right, you could just define
the external symbol,

    @newCAF$alias = external global i8

Unfortunately this still leaves the problem of local symbols.

  2. As place-holders for local symbols. All symbol references in
     emitted functions are replaced with references to aliases. This is
     done so that the compiler can emit LLVM IR definitions for
     functions without waiting for symbols they reference to become
     available (as our internal representation, Core, allows references
     in any order without forward declarations). This theoretically
     offers a performance improvement and somewhat simplifies the code
     generator. Here we emit aliases like,

         @SWn_srt$alias = alias private i8* bitcast (%SWn_srt_struct* @SWn_srt to i8*)

     again, using the `$alias` in all references,

That should also work in llvm IR. You can create a function without a
body or a GlobalVariable without an initializer and add it afterwards.

I'm not sure I follow. If I attempt to compile,

    declare i32 @main()
    define i32 @main() {
        ret i32 0
    }

It fails with,

    llc: test.ll:3:12: error: invalid redefinition of function 'main'
    define i32 @main() {
               ^

Check for example what llvm-as does when a variable or a function is
used before it is defined. Doesn't that work for you?

The problem here is that we don't know the type of the symbol at the
point of use so I need to assume it is something (e.g. i8*). Take for
instance the following example,

    define i32 @main() {
        // We don't know the type of f a priori, thus we assume
        // it is i8*
        %f = bitcast i8* f$alias to i32 ()*
        call i32 %f()
        ret i32 0
    }

Say then later in GHC's Core representation, we get a definition for
`f`. We have two ways of dealing with this,

    1. Declare it as @f and create an alias as we currently do,

           define i32 @f() {
               ret i32 0
           }
           @f$alias = alias private i8* @f

       but then we fail with recent LLVMs

    2. Declare it as @f$alias directly as I think you might be suggesting

           define i32 @f$alias() {
               ret i32 0
           }

       but then we get a type mismatch at the point of usage as we claim
       that @f$alias is of type i8*.

Is our (ab)use of aliases reasonable? If so, what options do we have to
fix this before LLVM 3.5? If not, what other mechanisms are there for
addressing the use-cases above in GHC?

It looks fairly likely llvm will accept arbitrary expressions as
aliasees again (see thread on llvmdev), but the restrictions inherent
from what alias are at the object level will remain, just be reworded
a bit. For example, we will have something along the lines of "the
aliasee expression cannot contain an undefined GlobalValue".

Alright. I'll put my GHC work aside until this is resolved in that
case. My current goal is to implement tables-next-to-code using the
recently merged prefix data syntax and symbol offset support that I sent
to the list yesterday. It would be great if you could ping me when this
is resolved so I know when this work can be continued.

Thanks,

- Ben

For a while now LLVM has started rejecting aliases referring to things
other than definitions[1].

We started checking for it. Aliases are just another label in an
object file. The linker itself doesn't know they exist and therefore
there is no way to represent an alias from foo to bar if bar is
undefined.

Sure. I think the only reason our use of aliases worked previously was
that the optimizer elided them long before they could make it into an
object file.

If that is the case, you should be able to just directly replace alias
with aliasee, no? In general you should not depend on an optimization
being run to produce correct code.

  2. As place-holders for local symbols. All symbol references in
     emitted functions are replaced with references to aliases. This is
     done so that the compiler can emit LLVM IR definitions for
     functions without waiting for symbols they reference to become
     available (as our internal representation, Core, allows references
     in any order without forward declarations). This theoretically
     offers a performance improvement and somewhat simplifies the code
     generator. Here we emit aliases like,

         @SWn_srt$alias = alias private i8* bitcast (%SWn_srt_struct* @SWn_srt to i8*)

     again, using the `$alias` in all references,

That should also work in llvm IR. You can create a function without a
body or a GlobalVariable without an initializer and add it afterwards.

I'm not sure I follow. If I attempt to compile,

    declare i32 @main()
    define i32 @main() {
        ret i32 0
    }

It fails with,

    llc: test.ll:3:12: error: invalid redefinition of function 'main'
    define i32 @main() {

There are no redeclarations in LLVM IR. You can just put top level
entities in any order:

define void @f() {
  call void @g()
  call void @h()
  ret void
}
declare void @g()
define void @h() {
  ret void
}

The problem here is that we don't know the type of the symbol at the
point of use so I need to assume it is something (e.g. i8*). Take for
instance the following example,

    define i32 @main() {
        // We don't know the type of f a priori, thus we assume
        // it is i8*
        %f = bitcast i8* f$alias to i32 ()*
        call i32 %f()
        ret i32 0
    }

Instead of having an f$alias, you could just have produced a

declare void f()

since you know the type it is being called with.

Say then later in GHC's Core representation, we get a definition for
`f`. We have two ways of dealing with this,

    1. Declare it as @f and create an alias as we currently do,

           define i32 @f() {
               ret i32 0
           }
           @f$alias = alias private i8* @f

       but then we fail with recent LLVMs

    2. Declare it as @f$alias directly as I think you might be suggesting

           define i32 @f$alias() {
               ret i32 0
           }

       but then we get a type mismatch at the point of usage as we claim
       that @f$alias is of type i8*.

No, the idea is to not have f$alias at all. Once you find that f has
to be defined, you just set its body (which turns it into a
definition).

I guess a better example might have been clang compiling:

void f(void);
void g(void) {
  f();
}
void f(void) {
}

In here f will be converted from a declaration to a definition.

Cheers,
Rafael

Rafael Espíndola <rafael.espindola@gmail.com> writes:

Sure. I think the only reason our use of aliases worked previously was
that the optimizer elided them long before they could make it into an
object file.

If that is the case, you should be able to just directly replace alias
with aliasee, no? In general you should not depend on an optimization
being run to produce correct code.

I absolutely agree. As far as I understand we ended up with the current
situation as no one could figure out a better way to resolve the typing
issue I clarify below. I'm trying to find a better solution.

That should also work in llvm IR. You can create a function without a
body or a GlobalVariable without an initializer and add it afterwards.

I'm not sure I follow. If I attempt to compile,

    declare i32 @main()
    define i32 @main() {
        ret i32 0
    }

It fails with,

    llc: test.ll:3:12: error: invalid redefinition of function 'main'
    define i32 @main() {

There are no redeclarations in LLVM IR. You can just put top level
entities in any order:

Alright, I misunderstood your point in that case. Thanks for the
clarification!

The problem here is that we don't know the type of the symbol at the
point of use so I need to assume it is something (e.g. i8*). Take for
instance the following example,

    define i32 @main() {
        // We don't know the type of f a priori, thus we assume
        // it is i8*
        %f = bitcast i8* f$alias to i32 ()*
        call i32 %f()
        ret i32 0
    }

Instead of having an f$alias, you could just have produced a

declare void f()

since you know the type it is being called with.

Bah, that was a poor choice of example on my part. A better one might be
the following:

Our C-- representation might refer to `f` without calling it (e.g. when
building a thunk). In this case we don't have access to the function's
signature, which we can only infer from a `call` node. For this reason,
we currently demote all pointers to some common type (i8* currently) so
we don't run into LLVM's type system in cases where we can't infer a
value's type.

No, the idea is to not have f$alias at all. Once you find that f has
to be defined, you just set its body (which turns it into a
definition).

See above for why I believe we need the alias.

I guess a better example might have been clang compiling:

void f(void);
void g(void) {
  f();
}
void f(void) {
}

In here f will be converted from a declaration to a definition.

That is to say that nothing is emitted until the entire compilation unit
is parsed (so we know which items are definitions and which are declarations)?

Cheers,

- Ben

To maybe clarify a bit - this is about a pass in GHC that translates an
intermediate program representation called "Cmm" into LLVM code. The
trouble stems from the fact that

1) Cmm is untyped, so whenever we see a label it might refer to data or
   functions, of pretty much arbitrary type

2) The Cmm code is generated iteratively, and as a good consumer we
   would like to "stream" the LLVM code as well. These intermediate
   representations can easily go up to millions of lines, after all.

This all works out fairly well, with the only stumbling block being the
types. After all, at the point where we emit a reference to a label we
don't know whether it is going to be defined later on in the output
file. We especially have no idea what its LLVM type is going to be.

To get around that we use aliases to essentially "strip" type
information from label references: If we refer to "label$alias" instead
of "label", we are still free to define "label" later on in whatever way
we see fit. Then we just set "label$alias" to a suitable cast, and let
the LLVM infrastructure handle the resolution.

That being said - there are actually a number of possible solutions
here, and we are currently trying to settle on what the "right thing to
do" is. In case we are being to tricky here, we might try to instead do
two passes over the output file, or scrap streaming altogether. All
depends on whether or not it is likely that this kind of usage remains
possible in future LLVM versions.

Greetings,
  Peter Wortmann

Couldn't you then conjure up an i8* global (or whatever other placeholder
you want) and when you discover that it is in fact a function, replace all
the uses with a bitcast from the actual function to an i8*?

I believe GHC streams textual LLVM assembly to a file and invokes llc
separately.

I could imagine a system that streams out a function definition at a time,
where each definition is preceded by declarations of all symbols that it
needs, with types inferred from the call nodes.

It looks fairly likely llvm will accept arbitrary expressions as
aliasees again (see thread on llvmdev), but the restrictions inherent
from what alias are at the object level will remain, just be reworded
a bit. For example, we will have something along the lines of "the
aliasee expression cannot contain an undefined GlobalValue".

And this is in: r210062.

Let us know how it looks from GHC's point of view.

Cheers,
Rafael

I think we might be able to relax our restrictions against aliases
referring to declarations if the alias is private. If the alias is
private, then the label never appears in the object file. The alias is
merely a Constant with an internal name. What do you think?

I think we might be able to relax our restrictions against aliases referring
to declarations if the alias is private. If the alias is private, then the
label never appears in the object file. The alias is merely a Constant with
an internal name. What do you think?

It feels a bit too fuzzy. Some reasons:

* Private does show up on the symbol table if we end up needing a
relocation to it.
* In general, what would it mean to use an alias to undefined on ELF
and COFF? It would be bad to be in a situation where the program works
if the use is optimized out but codegen asserts if it is not.
* It is not clear what it would buy you. With what we have on trunk
there is almost not consistency check until we get to the verifier, so
you can even eagerly create alias and delete aliases you don't need
before running the verifier.

Cheers,
Rafael

Reid Kleckner <rnk@google.com> writes:

Any word on this? As far as I can tell we are stuck on the GHC side
until we can alias declarations due to the existence of untyped external
references in the C-- representation.

I still have the same objections to having alias to declarations when
they are not supported in the object file. Since they are not
supported in the object file, they must be removed somewhere along the
way and it is not clear the burden should be in the LLVM to Obj
transition instead of in the transition from c-- to llvm.

It is also still not clear why you need aliases at all. LLVM has
declarations and bitcasts, so like c-- it is not type safe, but unlike
c-- (from your description) casts are needed.

Cheers,
Rafael

Rafael Espíndola <rafael.espindola@gmail.com> writes:

Any word on this? As far as I can tell we are stuck on the GHC side
until we can alias declarations due to the existence of untyped external
references in the C-- representation.

I still have the same objections to having alias to declarations when
they are not supported in the object file. Since they are not
supported in the object file, they must be removed somewhere along the
way and it is not clear the burden should be in the LLVM to Obj
transition instead of in the transition from c-- to llvm.

It is also still not clear why you need aliases at all. LLVM has
declarations and bitcasts, so like c-- it is not type safe, but unlike
c-- (from your description) casts are needed.

The aliases are necessary because LLVM makes us provide the type of the
value we are trying to cast: we can't construct a `bitcast` expression
for a symbol until we know its type. For this reason, we need to create
aliases hiding the true types of our definitions, allowing us to assume
some common type (e.g. `i8*`). As you said, these aliases are just
bitcasts; they are needed, however, because they can be created at the
point when we know the type of the symbol.

Over the weekend I reworked [1] the handling of aliases in the LLVM
backend, moving them from the point of usage to the compilation unit the
symbol is defined within. As all aliases now point to definitions this
avoids upsetting LLVM. The code needs some cleaning up but the approach
seems sound so it seems fair to say that we won't be needing aliases to
declarations afterall.

At this point the last(?) LLVM feature that we lack is symbol
offsets/prefix data. As has been discussed earlier [2] the current
realization of these concepts are quite suboptimal (and arguably most
other) for our usecase(s). Reid's proposal on this matter sounded quite
promising.

- Ben

[1] https://github.com/bgamari/ghc/compare/llvm-3.5-new
[2] http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-May/073260.html