MachineOperand: GlobalAddress vs. ExternalSymbol

Hi,
here I am again with "why is this so" kind of a question. Among different
types of MachineOperand there are MO_ExternalSymbol and MO_GlobalAddress.

For MO_GlobalAddress, we can get usefull information from the getGlobal()
method, which returns GlobalValue*. Wouldn'it it be better is
MO_GlobalAddress be called MO_GlobalValue, for consistency?

Second, MO_ExternalSymbol is used for storing name of external
variable/function, right? Why it's not possible to use MO_GlobalAddress,
where returned GlobalValue* has isExternal set to true? The
GlobalValue::getName would return the name of the symbol.

- Volodya

Hi,
here I am again with "why is this so" kind of a question. Among different
types of MachineOperand there are MO_ExternalSymbol and MO_GlobalAddress.

For MO_GlobalAddress, we can get usefull information from the getGlobal()
method, which returns GlobalValue*. Wouldn'it it be better is
MO_GlobalAddress be called MO_GlobalValue, for consistency?

I think that it could be reasonable to make this change, but it is also
reasonable to keep it named MO_GlobalAddress (the operand is the *address*
of the global not it's *value*).

The MachineInstr/Operand classes are a fertile source of warts like this,
which we are continually refactoring into a simpler interface. In this
particular case I don't think it's super important to make the change.

Second, MO_ExternalSymbol is used for storing name of external
variable/function, right? Why it's not possible to use MO_GlobalAddress,
where returned GlobalValue* has isExternal set to true? The
GlobalValue::getName would return the name of the symbol.

Using the GlobalValue is certainly the preferred way if you have it.
MO_ExternalSymbol should only be used for functions that might not
actually exist in the LLVM module for the function. In particular, this
would include any functions in a code-generator specific runtime library
and malloc/free. The X86 code generator compiles floating point modulus
into fmod calls, and 64-bit integer div/rem into runtime library calls.

If you have a GlobalValue*, please do use it, but if it's one of these
cases where the called function might not exist in the LLVM view of the
world, then use an ExternalSymbol.

-Chris

Chris Lattner wrote:

> Second, MO_ExternalSymbol is used for storing name of external
> variable/function, right? Why it's not possible to use MO_GlobalAddress,
> where returned GlobalValue* has isExternal set to true? The
> GlobalValue::getName would return the name of the symbol.

Using the GlobalValue is certainly the preferred way if you have it.
MO_ExternalSymbol should only be used for functions that might not
actually exist in the LLVM module for the function. In particular, this
would include any functions in a code-generator specific runtime library
and malloc/free. The X86 code generator compiles floating point modulus
into fmod calls, and 64-bit integer div/rem into runtime library calls.

And why isn't it possible to just make those functions known to LLVM? After
all, *I think*, if this function is to be called, it should be declared in
assembler, and so you have to pass some information abou those function to
the code printer. (Of course, it's possible to just directly print the
declarations, but that's scary).

There's another issue I don't understand. The module consists of functions and
constants. I'd expect that external function declarations are also constants,
with appropriate type. However, it seems they are not included in
[Module::gbegin(), Module::gend()], insteads, they a Function objects with
isExternal set to true.

To me this seems a bit confusing -- it would be clearer if there we plain
functions with bodies and everything else were GlobalValue.

Anyther question is about SymbolTable. Is it true that it's a mapping from
name to objects in Module, and than all objects accessible via SymbolsTable
are either in the list of functions or in the list of global values?

If you have a GlobalValue*, please do use it, but if it's one of these
cases where the called function might not exist in the LLVM view of the
world, then use an ExternalSymbol.

OK.

Thanks,
Volodya

> actually exist in the LLVM module for the function. In particular, this
> would include any functions in a code-generator specific runtime library
> and malloc/free. The X86 code generator compiles floating point modulus
> into fmod calls, and 64-bit integer div/rem into runtime library calls.

And why isn't it possible to just make those functions known to LLVM? After
all, *I think*, if this function is to be called, it should be declared in
assembler, and so you have to pass some information abou those function to
the code printer. (Of course, it's possible to just directly print the
declarations, but that's scary).

If you wanted to do that, it would be fine. Be aware that the code
generators are set up as function passes though, so you would have to
insert all function prototypes in the doInitialization(...) method of the
function pass: you can't just do it on the fly from runOn*Function.

The real reason that we aren't doing this currently is that we don't want
code generators to be hacking on the LLVM module. This greatly interferes
with JIT-style multi-pass optimization and other things. Unfortunately,
we are a long way from this though, as the lowering passes hack on the
LLVM and other stuff does as well. Unless you have a good reason to do
so, I would suggest trying to use MO_ExternalFunction just to make future
refactoring easier.

There's another issue I don't understand. The module consists of functions and
constants. I'd expect that external function declarations are also constants,
with appropriate type. However, it seems they are not included in
[Module::gbegin(), Module::gend()], insteads, they a Function objects with
isExternal set to true.

Module::gbegin/gend iterate over the global variables, and ::begin/end
iterate over the functions, some of which may be prototypes. Function
prototypes aren't really any more "constant" than other functions are.
Function prototypes do have correct types on them though.

To me this seems a bit confusing -- it would be clearer if there we plain
functions with bodies and everything else were GlobalValue.

The reason that we don't want to do this is that it makes it more
difficult to create a function and then fill in its body. Currently when
you create a function, you get a prototype. When you fill in its body,
you now have a defined function. In your scheme, the function prototype
and defined function objects would be different: to go from one to the
other, you would have to delete the object and reallocate it.

Anyther question is about SymbolTable. Is it true that it's a mapping from
name to objects in Module, and than all objects accessible via SymbolsTable
are either in the list of functions or in the list of global values?

Yup. There are also function-local symbol tables as well.

I wouldn't recommend depending too much on the names, because LLVM has a
unusual mechanism where it allows objects with different types to have
the same name. This means you can have:

int %foo(int %X) { ret int %X }
float %foo(float %X) { ret float %X }

In the context of a code generator, you should use the NameMangler
interface to make everything just work.

If you're doing something else and think you need the symbol table, please
let me know. Clients of the SymbolTable class are extremely rare (by
design). The SymbolTable class is mostly an internal class that is
automagically used by the system to provide naming invariants and allow
efficient lookup for the rare clients that need it.

-Chris

This confused Vladimir and I remember it confusing me when I was
reviewing LLVM core a few months ago. Would it be worthwhile to consider
naming these globals_begin/globals_end and
functions_begin/functions_end? so their use is completely clear?

Reid

Sure, they can be renamed. For commonly used methods like this I would
like to keep them relatively terse though. How about gvbegin/end and
funcbegin/end? Would anyone like to make a patch? :slight_smile:

-Chris

sure, that's reasonable. I'll make the patch but later this weekend ..
bigger fish to fry :slight_smile:

Reid.

sure, that's reasonable. I'll make the patch but later this weekend ..
bigger fish to fry :slight_smile:

Oh yes definitely. More to the point, do you think that a change like
would really help?

-Chris

Chris Lattner wrote:

> And why isn't it possible to just make those functions known to LLVM?
> After all, *I think*, if this function is to be called, it should be
> declared in assembler, and so you have to pass some information abou
> those function to the code printer. (Of course, it's possible to just
> directly print the declarations, but that's scary).

If you wanted to do that, it would be fine. Be aware that the code
generators are set up as function passes though, so you would have to
insert all function prototypes in the doInitialization(...) method of the
function pass: you can't just do it on the fly from runOn*Function.

Yes, I understand that.

The real reason that we aren't doing this currently is that we don't want
code generators to be hacking on the LLVM module. This greatly interferes
with JIT-style multi-pass optimization and other things. Unfortunately,
we are a long way from this though, as the lowering passes hack on the
LLVM and other stuff does as well. Unless you have a good reason to do
so, I would suggest trying to use MO_ExternalFunction just to make future
refactoring easier.

I think I more or less understand this motivation.

> There's another issue I don't understand. The module consists of
> functions and constants. I'd expect that external function declarations
> are also constants, with appropriate type. However, it seems they are not
> included in [Module::gbegin(), Module::gend()], insteads, they a Function
> objects with isExternal set to true.

Module::gbegin/gend iterate over the global variables, and ::begin/end
iterate over the functions, some of which may be prototypes. Function
prototypes aren't really any more "constant" than other functions are.

I disagree. Say there's declaration of external function "printf". Then it's
just a constant global address. In assembler it will be

   extern printf: label;

which is not that different from assembler for other constants. For example,
for external data reference I have to produce the same assembler.

BTW, there's inconsistency in how X86 backend handles constants and functions.
Consider:

%.str_1 = constant [11 x sbyte] c"'%c' '%c'\0A\00"

implementation ; Functions:

declare int %printf(sbyte*, ...)

int %main() {
entry:
        %tmp.0.i = call int (sbyte*, ...)*
        %printf( sbyte* getelementptr ([11 x sbyte]* %.str_1, long 0, l
        ret int 0
}

The assembler produces by X86 backend is:

        call printf
........
        .globl _2E_str_1
        .data
        .align 1
        .type _2E_str_1,@object
        .size _2E_str_1,11
_2E_str_1:

That is, the name of "str1" is mangled, but the name of function is not. I
don't see the reasons for different handling of those two kinds of names.

> To me this seems a bit confusing -- it would be clearer if there we plain
> functions with bodies and everything else were GlobalValue.

The reason that we don't want to do this is that it makes it more
difficult to create a function and then fill in its body. Currently when
you create a function, you get a prototype. When you fill in its body,
you now have a defined function. In your scheme, the function prototype
and defined function objects would be different: to go from one to the
other, you would have to delete the object and reallocate it.

Can't you store all functions in the list of global values? That would be
quite clear: all top-level module elements are global values, and a present
in the global list.

The functons list can contains either both functions with bodies or without,
or only with bodies. In the latter case, when you create function, it's added
only to global values list. When you add the first basic block, it's also
added to the list of functions.

> Anyther question is about SymbolTable. Is it true that it's a mapping
> from name to objects in Module, and than all objects accessible via
> SymbolsTable are either in the list of functions or in the list of global
> values?

Yup. There are also function-local symbol tables as well.

I wouldn't recommend depending too much on the names, because LLVM has a
unusual mechanism where it allows objects with different types to have
the same name. This means you can have:

int %foo(int %X) { ret int %X }
float %foo(float %X) { ret float %X }

In the context of a code generator, you should use the NameMangler
interface to make everything just work.

If you're doing something else and think you need the symbol table, please
let me know. Clients of the SymbolTable class are extremely rare (by
design). The SymbolTable class is mostly an internal class that is
automagically used by the system to provide naming invariants and allow
efficient lookup for the rare clients that need it.

Thanks for explanation. I don't have a use of SymbolTable yet, I was just
wondering if I have to use it for something :wink:

> > There's another issue I don't understand. The module consists of
> > functions and constants. I'd expect that external function declarations
> > are also constants, with appropriate type. However, it seems they are not
> > included in [Module::gbegin(), Module::gend()], insteads, they a Function
> > objects with isExternal set to true.
>
> Module::gbegin/gend iterate over the global variables, and ::begin/end
> iterate over the functions, some of which may be prototypes. Function
> prototypes aren't really any more "constant" than other functions are.

I disagree. Say there's declaration of external function "printf". Then
it's just a constant global address. In assembler it will be

   extern printf: label;

which is not that different from assembler for other constants. For
example, for external data reference I have to produce the same
assembler.

Of course, global variable addresses are link-time constants. This is the
motivation for the ConstantPointerRef class, and is why GlobalValue will
eventually derive from Constant (see earlier discussion).

My point was that function prototypes are no different than functions with
bodies in this respect. Also, you have to be careful to distinguish
between the fact that the *address* of a global is always a constants,
regardless of whether it is a global variable, function, internal, or
external... but the *contents* of a global variable are only sometimes
constant (indicated by GlobalVariable::isConstant()).

BTW, there's inconsistency in how X86 backend handles constants and
functions. Consider:

%.str_1 = constant [11 x sbyte] c"'%c' '%c'\0A\00"

implementation ; Functions:

declare int %printf(sbyte*, ...)

int %main() {
entry:
        %tmp.0.i = call int (sbyte*, ...)*
        %printf( sbyte* getelementptr ([11 x sbyte]* %.str_1, long 0, l
        ret int 0
}

The assembler produces by X86 backend is:

        call printf
........
        .globl _2E_str_1
        .data
        .align 1
        .type _2E_str_1,@object
        .size _2E_str_1,11
_2E_str_1:

That is, the name of "str1" is mangled, but the name of function is not. I
don't see the reasons for different handling of those two kinds of names.

We don't mangle names unless we have to. The NameMangler interface
encapsulates this behavior. In particular we only mangle a name if it's
an internal symbol, if there are two globals named the same thing, or if
there is an invalid character (like '.') in the name.

> > To me this seems a bit confusing -- it would be clearer if there we
> > plain functions with bodies and everything else were GlobalValue.
>
> The reason that we don't want to do this is that it makes it more
> difficult to create a function and then fill in its body. Currently when
> you create a function, you get a prototype. When you fill in its body,
> you now have a defined function. In your scheme, the function prototype
> and defined function objects would be different: to go from one to the
> other, you would have to delete the object and reallocate it.

Can't you store all functions in the list of global values? That would

Sure, we could have one unified list I guess. It is very common to want
to iterate over just globals or just functions though. *shrug* I don't
think it makes that much of a difference, do you?

be quite clear: all top-level module elements are global values, and a
present in the global list.

All top-level module elements DO derive from GlobalValue, and are all
present in the gbegin/gend and begin/end lists owned by Module.

The functons list can contains either both functions with bodies or
without, or only with bodies. In the latter case, when you create
function, it's added only to global values list. When you add the first
basic block, it's also added to the list of functions.

I really don't see the advantage of this. You're talking about adding a
third list that contains a union of the two? I don't see what you are
buying here.

Thanks for explanation. I don't have a use of SymbolTable yet, I was just
wondering if I have to use it for something :wink:

You probably won't. If you find yourself needing to, ask before you do as
there is probably an easier was to do whatever it is that you want to do.
:slight_smile:

-Chris