gcc like attributes and annotations

hi all,

out of a matter of fact I am still using llvm version 1.5. I don't know
how 1.6 works in this matter.

When translating a complex c application to llvm bytecodes, some
semantics are lost:

Take for isntance the interesting attribute to put a variable in the
thread local data section (.tdata), this would be interesting to have in
llvm.

like in GCC you write:

int x __attribute__((section(".tdata")));

however the llvm bytecode (llvm-gcc -S) does not show anything like this
attribute.

There is this Annotable base class, which is used to annotate a
MachineFunction to a LLVM Function. Function support Annotable but
GlobalVariable as of 1.5 not.

I would generally be interested and could contribute to extending LLVM
by allowing more Annotatinons than currently possible.

Why not make things like Instructions Annotable too?
For instance pointer creating Instructions, like alloca, malloc, ..
would be nice if they could be Annotable and so one could add symbolic
information to the type beeing used.

For instance:

struct A {
   int x;
};

struct B {
   int y;
};

get mapped to the same type -> { int }

%struct.A = type { int }
%struct.B = type { int }

BTW: How would one generate a type alias like the above through the LLVM
API?

  Is there a Type like TypeAlias - I couldn't find one. Here I
  have no ability to set a name with a Type:

  vector< const Type * > tvec = { Type::IntTy };
  Type * ty = StructType::get( tvec );

But what I wanted to say is that they are the same type.
This structural type system (compared to nominal typesystems) is very
intersting and has good support for optimizations.

But sometimes it would be interesting to actually get symbol information
about a type beeing used, without the need for full featured debug
information ala DWARF.

This could be also solved by introducing Annotable.
For instance if the alloca/malloc/.. instruction would get an Anntation
about a symbolic type which could look like:

{ ("x",int) }

One could use the DEF/USE and operand information in the byte code to
know which symbolic field was accessed for instance through getelementptr.

I don't know how you feel about that, but I there would be many
circumstances where Annotations could help getting more information out
of the bytecode.

thanks in advance
-- Jakob

out of a matter of fact I am still using llvm version 1.5. I don't know
how 1.6 works in this matter.

ok.

When translating a complex c application to llvm bytecodes, some
semantics are lost:

Take for isntance the interesting attribute to put a variable in the
thread local data section (.tdata), this would be interesting to have in
llvm.

like in GCC you write:

int x __attribute__((section(".tdata")));

however the llvm bytecode (llvm-gcc -S) does not show anything like this
attribute.

LLVM 1.6 and the "new front-end" already handle this right. Here's the bugzilla bug corresponding to it:
http://llvm.cs.uiuc.edu/bugs/show_bug.cgi?id=659

There is this Annotable base class, which is used to annotate a
MachineFunction to a LLVM Function. Function support Annotable but
GlobalVariable as of 1.5 not.

I would generally be interested and could contribute to extending LLVM
by allowing more Annotatinons than currently possible.

Why not make things like Instructions Annotable too?
For instance pointer creating Instructions, like alloca, malloc, ..
would be nice if they could be Annotable and so one could add symbolic
information to the type beeing used.

At one point in time, Value was annotatable. The problem with this was two fold:

1. This bloat every value in the system, by adding an extra pointer.
2. These annotations would get stale and not be updated correctly.

The problem is basically that adding annotations really amounts to extending the LLVM IR, and making it look like something simple doesn't make it easier to deal with. For example, if you add an "I'm special" attribute to an instruction, then the function is cloned by some pass, is that attribute copied or not? What if it is deleted, moved, rearranged, etc? Further, how can annotations be serialized to .ll files and .bc files? In llvm, we always want "opt -pass1 -pass2" to be the same as "opt -pass1 | opt -pass2", which would break if annotations can't be serialized (which they can't currently).

As a historical curiosity, Function still needs to be annotatable due to the LLVM code generator relying on it. This will be fixed in LLVM 1.8 and Function will not be annotable anymore.

If you *really* just want per-pass local data, you should just use an std::map from the Value* to your data.

%struct.A = type { int }
%struct.B = type { int }

BTW: How would one generate a type alias like the above through the LLVM
API?

Add two entries to the module symbol table for the same Type using Module::addTypeName.

But sometimes it would be interesting to actually get symbol information
about a type beeing used, without the need for full featured debug
information ala DWARF.

This isn't something you can do, this is far more tricky than you make it out to be. :slight_smile:

This could be also solved by introducing Annotable.
For instance if the alloca/malloc/.. instruction would get an Anntation
about a symbolic type which could look like:

{ ("x",int) }

One could use the DEF/USE and operand information in the byte code to
know which symbolic field was accessed for instance through getelementptr.

Again, this is effectively extending the LLVM IR. Calling it an 'annotation' doesn't make it simpler. :slight_smile: Also, the front-end would have to be modified to generate the annotation.

I don't know how you feel about that, but I there would be many
circumstances where Annotations could help getting more information out
of the bytecode.

While I understand the general utility of annotations, the LLVM Annotation facility has several problems (some of which are described above) that make them not work well in practice. Even if they did, they would still have the "updating" class of problems, which I'm not sure how to solve.

If you think that this is something that would be really useful, you can come up with solutions for these issues, and you're willing to implement it, then this is the right place to talk about the design of the new facility. :slight_smile:

-Chris

hi Chris!

thanks for your reply.
First of all I did not know about the history with the Annotation stuff.
Annotable for me was a way how one could realize this things. So as I
see it right now - it is more that Annotable will completly vanish soon.
This is interesting to me.

Chris Lattner schrieb:

When translating a complex c application to llvm bytecodes, some
semantics are lost:

LLVM 1.6 and the "new front-end" already handle this right. Here's the
bugzilla bug corresponding to it:
http://llvm.cs.uiuc.edu/bugs/show_bug.cgi?id=659

Great! The bug information is rather scarce. I would be interested, how
you implemented it. Did you add another bytecode entry for the section
value mapping? Is it possible to add attributes to other elements like
functions as well?

Did you think about a mapping of common attributes on different
platforms. For instance DLLMain Entry point under Win32 and the
__attribute__((constructor)) under Linux.

I would generally be interested and could contribute to extending LLVM
by allowing more Annotatinons than currently possible.

Okay so I am on quite the opposite attitude than the LLVM team towards
that issue :slight_smile:

At one point in time, Value was annotatable. The problem with this was
two fold:

1. This bloat every value in the system, by adding an extra pointer.
2. These annotations would get stale and not be updated correctly.

The problem is basically that adding annotations really amounts to
extending the LLVM IR, and making it look like something simple doesn't
make it easier to deal with. For example, if you add an "I'm special"
attribute to an instruction, then the function is cloned by some pass,
is that attribute copied or not? What if it is deleted, moved,
rearranged, etc? Further, how can annotations be serialized to .ll
files and .bc files? In llvm, we always want "opt -pass1 -pass2" to be
the same as "opt -pass1 | opt -pass2", which would break if annotations
can't be serialized (which they can't currently).

I get you 100 % here. But as you say later in the mail, many information
is done by some runtime std::map<Value*,foo> stuff. Which is really
handy at runtime, but I *had* serialization in mind when I was thinking
about Annotations. I see annotations as a way to serialize some extra
information with the bytecode without having to extend/change the core
classes. The best way to implemented in runtime is to use some kind of
std::map subscripting, plus the additional benefit that you can
serialize it to the bytecode. Perhaps the best of both worlds.

Two things here:
(1) Annotations should not be something which really changes the meaning
of a Value/Type. All the passes should work without the annotation.

(2) I think annotations are a handy way to augment the bytecode without
changing the bytecode format. It gives people the freedom to add some
extra information. This is also interesting since changing the
bytecode/adding fields to Value/... is often not a real option since one
wants to work with production core libraries. (like I do now).

Perhaps the thing could be solved by adding policy statemetns to
annotations. I could imagine the inventor of an Annotation should think
about how the annotation should behave during optimisation/change. So
the anntation should have a policy field which defaults to DontCare. In
that case the user of the Annotation cannot be sure that it will get
retained or something like that.

Given the discussion that happens in the higher level vms (like Gilad
Bracha's Paper on pluggable type systems) gives some hints about the
difficulties in changing Instruction Sets over time. I think core system
functionality is invariant, but meta information, that is not essential
for the application to work should be pluggable too.

As a historical curiosity, Function still needs to be annotatable due to
the LLVM code generator relying on it. This will be fixed in LLVM 1.8
and Function will not be annotable anymore.

If you *really* just want per-pass local data, you should just use an
std::map from the Value* to your data.

Why not see Annotations as the means to serialize these Maps. Maybe we
could add an Annotations table that maps Value types to ConstantPool
entries or something like that. This would make it more easily for LLVM
libraries in other languages too.

%struct.A = type { int }
%struct.B = type { int }

BTW: How would one generate a type alias like the above through the LLVM
API?

Add two entries to the module symbol table for the same Type using
Module::addTypeName.

Very interesting. I then have to take the type by calling
Module::getTypeByName to have a second Type pointer or?

since I saw the llvm-gcc generates code like:

%pa = alloca %struct.A
%pb = alloca %struct.B

this means that the AllocaInst must have knowledge about two types which
can only be so by having two different pointers? right?

But sometimes it would be interesting to actually get symbol information
about a type beeing used, without the need for full featured debug
information ala DWARF.

This isn't something you can do, this is far more tricky than you make
it out to be. :slight_smile:

Hehe, I know. Certainly if someone like you says that. But *if* the
front end is aware of the annotation, which would be doable, and the
annations are serializable in bytecode, then one would have this
information during LLVM bytecode processing as well. One could also emit
the symbolic information like Relocation informations in a .section or
as I am currently working with the JIT - use the JITs information about
the annotations to get the symbolic information.

This could be also solved by introducing Annotable.
For instance if the alloca/malloc/.. instruction would get an Anntation
about a symbolic type which could look like:

{ ("x",int) }

One could use the DEF/USE and operand information in the byte code to
know which symbolic field was accessed for instance through
getelementptr.

Again, this is effectively extending the LLVM IR. Calling it an
'annotation' doesn't make it simpler. :slight_smile: Also, the front-end would have
to be modified to generate the annotation.

See above. Every information must be understood in order to be usable.
But I would do it as an annotation since it is just additional meta
information and the program would perfectly run without the information.

I don't know how you feel about that, but I there would be many
circumstances where Annotations could help getting more information out
of the bytecode.

While I understand the general utility of annotations, the LLVM
Annotation facility has several problems (some of which are described
above) that make them not work well in practice. Even if they did, they
would still have the "updating" class of problems, which I'm not sure
how to solve.

If you think that this is something that would be really useful, you can
come up with solutions for these issues, and you're willing to implement
it, then this is the right place to talk about the design of the new
facility. :slight_smile:

Hehe. Yes. I am just getting comfortable with the framework and I think
it is very nice. If I have some more points (which I hope I have) I will
definitely talk to you.

-- Jakob

Hi Jakob,

I have some thoughts on this too ..

I get you 100 % here. But as you say later in the mail, many information
is done by some runtime std::map<Value*,foo> stuff. Which is really
handy at runtime, but I *had* serialization in mind when I was thinking
about Annotations. I see annotations as a way to serialize some extra
information with the bytecode without having to extend/change the core
classes. The best way to implemented in runtime is to use some kind of
std::map subscripting, plus the additional benefit that you can
serialize it to the bytecode. Perhaps the best of both worlds.

Two things here:
(1) Annotations should not be something which really changes the meaning
of a Value/Type. All the passes should work without the annotation.

(2) I think annotations are a handy way to augment the bytecode without
changing the bytecode format. It gives people the freedom to add some
extra information. This is also interesting since changing the
bytecode/adding fields to Value/... is often not a real option since one
wants to work with production core libraries. (like I do now).

Perhaps the thing could be solved by adding policy statemetns to
annotations. I could imagine the inventor of an Annotation should think
about how the annotation should behave during optimisation/change. So
the anntation should have a policy field which defaults to DontCare. In
that case the user of the Annotation cannot be sure that it will get
retained or something like that.

Given the discussion that happens in the higher level vms (like Gilad
Bracha's Paper on pluggable type systems) gives some hints about the
difficulties in changing Instruction Sets over time. I think core system
functionality is invariant, but meta information, that is not essential
for the application to work should be pluggable too.

As Chris mentioned, I would prefer that we keep annotations out of the
core IR altogether as they are fraught with problems that are not easy
to resolve. However, I understand where you're coming from in wanting to
keep additional information with the bytecode. I have wanted the same
thing for use by front end or specialized tools. For example an IDE that
could keep track of source information or a language that needs special
passes that can only be done at link time.

In thinking about the "right" way to do this, I came up with the idea of
a single "blob" of data that could be appended to a Module. This single
"annotation" would always be ignored by LLVM, would not require
significant additional space to construct, and there is already a
mechanism for constructing the information via the bytecode reader's
handler interface (might need some extension).

This is simply a way of making that std::map of information embeddable
in the bytecode. It means the information is stored in one additional
bytecode block (at the end) where it doesn't have any impact on LLVM
(JIT/storage/etc). The only question is: how do multiple tools avoid
collision in this approach. Some kind of registry or partitioning of the
data could likely solve that.

> As a historical curiosity, Function still needs to be annotatable due to
> the LLVM code generator relying on it. This will be fixed in LLVM 1.8
> and Function will not be annotable anymore.
>
> If you *really* just want per-pass local data, you should just use an
> std::map from the Value* to your data.

Why not see Annotations as the means to serialize these Maps. Maybe we
could add an Annotations table that maps Value types to ConstantPool
entries or something like that. This would make it more easily for LLVM
libraries in other languages too.

This is similar to my idea above, but I wouldn't want to restrict it to
any particular data structure. The application can construct the data
however it wishes and simply pass a pointer to a block of memory to the
bytecode writer.

Reid

Hi Reid,

Reid Spencer schrieb:

I have some thoughts on this too ..

Great!

I get you 100 % here. But as you say later in the mail, many information
is done by some runtime std::map<Value*,foo> stuff. Which is really
handy at runtime, but I *had* serialization in mind when I was thinking
about Annotations. I see annotations as a way to serialize some extra
information with the bytecode without having to extend/change the core
classes. The best way to implemented in runtime is to use some kind of
std::map subscripting, plus the additional benefit that you can
serialize it to the bytecode. Perhaps the best of both worlds.

...

As Chris mentioned, I would prefer that we keep annotations out of the
core IR altogether as they are fraught with problems that are not easy
to resolve. However, I understand where you're coming from in wanting to
keep additional information with the bytecode. I have wanted the same
thing for use by front end or specialized tools. For example an IDE that
could keep track of source information or a language that needs special
passes that can only be done at link time.

Yes.

In thinking about the "right" way to do this, I came up with the idea of
a single "blob" of data that could be appended to a Module. This single
"annotation" would always be ignored by LLVM, would not require
significant additional space to construct, and there is already a
mechanism for constructing the information via the bytecode reader's
handler interface (might need some extension).

As far as locality is concerned, perhaps it would make sense to make
such a blob on every primary object (module,function), so that
annotations that only apply to a certain function can be stored directly
in the function. That would make certain collisions easier to resolve.

This is simply a way of making that std::map of information embeddable
in the bytecode. It means the information is stored in one additional
bytecode block (at the end) where it doesn't have any impact on LLVM
(JIT/storage/etc). The only question is: how do multiple tools avoid
collision in this approach. Some kind of registry or partitioning of the
data could likely solve that.

Yes that sounds like a doable approach. But I would not write any binary
data into the blob, but use a LLVM type encoding approach/table
approach. Many annotations are simple or can be composite simple types
and people should be encouraged to store data in a way, that makes it
possible to read it without library code. If you just serialize C++
structs, you end up relying heavy on the code that wrote it. Which makes
it harder for tools to introspect anntoations. Java's annotations rely
on simple types for the same principle and I think it is the right way
for most things. There could be an opaque type for more complex
information, which should be discouraged.

This would also make it possible to have tripple of
Value,AnnotationType,Name to match the Annotation, which helps to the
solve the collision problem too.

The lookup mechanism could lookup by anything of the tripple:
- Target Value
- AnnotationType
- Name

NULL values are wildcards.

So you could say:

Give me all annotations for a Value*

/// Function local annotations
Value* v = ...
vector< const Annotation *> &ans = curFunction->lookupAnnotation( v,
NULL, NULL);

Or based on a specific type:

/// Module wide annoations
AnnotationType *type = ...
Value< const Annotation *> &ans = module->lookupAnnotation( v, type, NULL );

This just random thought though.

As a historical curiosity, Function still needs to be annotatable due to
the LLVM code generator relying on it. This will be fixed in LLVM 1.8
and Function will not be annotable anymore.

If you *really* just want per-pass local data, you should just use an
std::map from the Value* to your data.

Why not see Annotations as the means to serialize these Maps. Maybe we
could add an Annotations table that maps Value types to ConstantPool
entries or something like that. This would make it more easily for LLVM
libraries in other languages too.

This is similar to my idea above, but I wouldn't want to restrict it to
any particular data structure. The application can construct the data
however it wishes and simply pass a pointer to a block of memory to the
bytecode writer.

Great that we have a similar view. I would use a public simple type
encoding for the annotations, So that annotations are introspectable
without knowing much on the details of the annotation data. This helps
to keep the bytecode free from language specific data encoding too.

-- Jakob

This is a interesting thread.

I think this would also help with compiling scripting languages such
as JavaScript/Python etc. We could keep the high level meta data and
runtime binding info as language specific bytecode in the file and
just have the parts that are easy to represent as compileable in the
main object sections. There is no intrinsic reason for all the runtime
type information to get compiled into the core object module. Also I
could bypass code thats difficult to compile and just stuff its
bytcode into this section. So I think this really helps with partial
compliation and supporting languags that have complex runtimes.
The llvm bycode section would just get a stub runtime upcall for code
that not compiled.

For java for example this would probably be the compiled parts with
stubs and a regular classfile for the runtime data with compiled
functions converted to native.

In the short term I think I'll simply use the class file format in my
native compiled classes
and wait and see how this turns out. I've been stuck thinking about
this for two months.

Thanks for the ideas.

Mike

Hi Mike,

hope you are doing well with the llvm gcjx backend. I am currently
writing an llvm backend for a C like language for tracing (like D in
dtrace). I am very interested in this area. Do you currently put your
work in a repository? (maybe as Tom suggested gcjx.sf.net would be an
easy start - since it would not require gcc committer status). I am keen
on getting LLVM support for gcj. Maybe we could also wrap the LLVM
infrastructure in CNI so that the ecj compiler could be targetable to
LLVM more easily.

Mike Emmel schrieb:

This is a interesting thread.

Thank you.

I think this would also help with compiling scripting languages such
as JavaScript/Python etc. We could keep the high level meta data and
runtime binding info as language specific bytecode in the file and
just have the parts that are easy to represent as compileable in the
main object sections. There is no intrinsic reason for all the runtime
type information to get compiled into the core object module. Also I
could bypass code thats difficult to compile and just stuff its
bytcode into this section. So I think this really helps with partial
compliation and supporting languags that have complex runtimes.
The llvm bycode section would just get a stub runtime upcall for code
that not compiled.

Hmm. Not sure I understand you 100 % here. I think the most interesting
use for annotations is if you want to augment information at some point
in the bytecode. For instance if you want to say, you label exactly this
Value.

If on the other hand you are developing higher level constructs like a
symbolic dispatch facility for dynamic languages, you could as well put
the information in C-like data structures. Even *if* you want to add raw
bytecode into the modules, which I think is better to associate more
externally like the gcj-dbtool is doing it, you just need some kind of
blob, but not really annotions.

Btw: I am very interested in the dynamic languages project you are
mentioning. Do you have any dynamic language frontends in use. The PyPy
project I think is targeting LLVM too. I could imagine that this meta
information is just stored in plain C structures.

For java for example this would probably be the compiled parts with
stubs and a regular classfile for the runtime data with compiled
functions converted to native.

Hmm. Maybe we should follow the gcj approach. Or at least use an
interchangeable metadata spec. At last years FOSDEM there was a short
discussion about a more general meta-data format, which would make gcj
generated object code self containing an not needing the classfiles when
compiling. Currently AFAIK gcj uses Class structures to represent the
runtime meta information (for getClass( ) and reflection stuff as well
as the indirect dispatch). You could model this approach. I think
class-level metadata in special ELF sections for instance could provide
a good way to make the gcj generated code more abstract in a way that
external tools like linkers could understand the format. Since LLVM
supports special sections now, we could use a similar approach here.

In the short term I think I'll simply use the class file format in my
native compiled classes
and wait and see how this turns out. I've been stuck thinking about
this for two months.

So you are currently compiling class files to LLVM modules and you place
the Java .class file inforamtion in the LLVM bytecode too? Or am I
missing somethign here?

--Jakob

Hi Mike,

hope you are doing well with the llvm gcjx backend. I am currently
writing an llvm backend for a C like language for tracing (like D in
dtrace). I am very interested in this area. Do you currently put your
work in a repository? (maybe as Tom suggested gcjx.sf.net would be an
easy start - since it would not require gcc committer status). I am keen
on getting LLVM support for gcj. Maybe we could also wrap the LLVM
infrastructure in CNI so that the ecj compiler could be targetable to
LLVM more easily.

Hmm I'll hav to look at the data structures more I've not really
thought about a mapping to java there is a lvm backend as part of PyPy
the python compliler I'm looking right now at what they do.

Mike Emmel schrieb:
> This is a interesting thread.
Thank you.

>
> I think this would also help with compiling scripting languages such
> as JavaScript/Python etc. We could keep the high level meta data and
> runtime binding info as language specific bytecode in the file and
> just have the parts that are easy to represent as compileable in the
> main object sections. There is no intrinsic reason for all the runtime
> type information to get compiled into the core object module. Also I
> could bypass code thats difficult to compile and just stuff its
> bytcode into this section. So I think this really helps with partial
> compliation and supporting languags that have complex runtimes.
> The llvm bycode section would just get a stub runtime upcall for code
> that not compiled.
>

Hmm. Not sure I understand you 100 % here. I think the most interesting
use for annotations is if you want to augment information at some point
in the bytecode. For instance if you want to say, you label exactly this
Value.

If on the other hand you are developing higher level constructs like a
symbolic dispatch facility for dynamic languages, you could as well put
the information in C-like data structures. Even *if* you want to add raw
bytecode into the modules, which I think is better to associate more
externally like the gcj-dbtool is doing it, you just need some kind of
blob, but not really annotions.

Btw: I am very interested in the dynamic languages project you are
mentioning. Do you have any dynamic language frontends in use. The PyPy
project I think is targeting LLVM too. I could imagine that this meta
information is just stored in plain C structures.

Yep I just found it.
That was the route I was taking intially then the more I thought about it I was
like why hide meta data in C structs. It's certainly the common practice but
its not clear its the best. My first thought was the java class file format now
I've even moved beyond that why not XML ??
If at some some point size is a problem you can gzip it.
Its a format that is easy to use by a huge variety of tools. It
actually parses fairly nicely
If performance becomes a problem its a good starting point for translation
to a faster format plus you can consider global optimizations a
unified string table for
a package for example.

I can't really come up with a good reason to not do it in XML there
are plenty of traditional compliers Its worth exploring this approach.

> For java for example this would probably be the compiled parts with
> stubs and a regular classfile for the runtime data with compiled
> functions converted to native.

Hmm. Maybe we should follow the gcj approach. Or at least use an
interchangeable metadata spec. At last years FOSDEM there was a short
discussion about a more general meta-data format, which would make gcj
generated object code self containing an not needing the classfiles when
compiling. Currently AFAIK gcj uses Class structures to represent the
runtime meta information (for getClass( ) and reflection stuff as well
as the indirect dispatch). You could model this approach. I think
class-level metadata in special ELF sections for instance could provide
a good way to make the gcj generated code more abstract in a way that
external tools like linkers could understand the format. Since LLVM
supports special sections now, we could use a similar approach here.

Yep this is true but agian a inital XML format in my opinion makes
sense to facilitate
these types of translations. I think there my be several that are
useful in different circumstances.

1.) XML format for development debugging wrapper generation.
2.) Classfile format may work better with "traditional" jvms.
3.) Elf based format for elf systems.
4.) Weird formats for embedded systems esp ones were the total amount of code
is fixed or well understood this includes stripping out unused code etc.
5.) LLVM/jit friendly format ??

My point is there are a lot of formats that may be optimum but by
starting with XML you can easily convert with the code is deployed too
the best format.

>
> In the short term I think I'll simply use the class file format in my
> native compiled classes
> and wait and see how this turns out. I've been stuck thinking about
> this for two months.
So you are currently compiling class files to LLVM modules and you place
the Java .class file inforamtion in the LLVM bytecode too? Or am I
missing somethign here?
>

I've not got that far I'm walking to tree converting methods I just
started working on the
class definition.

Also I've got another task of upgrading the webkit gtk port that I
need to do right now
I also just got the directfb backend into the mainline gtk cvs.
But as soon as the browser upgrad is done I'll get back to work I was
also stalled conceptually till now.
Ohh and I moved from Boston to Chicago to LA in less then four months :frowning:

Also I've thinking that instead of using the traditional approach of
calling the compiled methods with a pointer to the object I want to
do it different.

generally a method is converted to native code like this.

class foo {
  void bar(){}
}

becomes in C

void bar( objptr *foo );

But why not this...

void bar( clazz *foocls, instantpotr *fooinstance, void *vtable );

So instead of definined a fixed native struct for a class such as

struct foo {
   clazz *cls;
   vtable *my vtabe;
   int instantdata1;
}

Or something like that we break out the three pieces of info needed in
native code
the class pointer for class static viariables the instance variable
struct pointer and the
vtable for virtual methods.

The cool thing is this works for any language that has the concepts of
class objects and
instance objects and methods.
Even if it does not have class objects it still can call the native
method with two objects of the right type since there just plain c
structs.

The xml meta data fits in nicely since it would say allow both Java
and python to use the same native library. the price is your filling a
lot more registers with args but generally you have either machines
with a few registers or a bunch so I'm not sure thats a huge deal.
Arm is the only thing that comes to mind where this may cause problems.

I'm sure there is a chance to optimize out any or all of the args
depending on them being used in the call. This could easily be
reflected in the meta data.

So the idea now is break out the class object and vtable pointers for
the native methods
and put a ton of info in a XML meta data format section.

Finally I'm wondering if introducing a interpeter style stack may make
sense with so many args and the fact we may want to emliminate some
based on use.

But even here I'm thinking of two entry points a pure stack one and a
register one

so for interpeters calling we would have

void bar( StackPtr *stack) {

   public barInner :
       ( cls * stack[0], objptr * stack[1], vtable * stack[3] ..... ){

    }

}
So I can public two entry points one that take args from memory and
converts to the native calling convention the second takes the native
convention.

I'm trying to write nested functions in a C like language with the
inner function also public.

Agian these entry points can be published in the meta data and if
wanted for example the outer one could be stripped off.

Mike

--Jakob
>
>
>
>>Hi Reid,
>>

snipped Older conversation that I've now drifted far away from.

thanks for your reply.

Sorry for the delay, I've been buried in email lately.

When translating a complex c application to llvm bytecodes, some
semantics are lost:

LLVM 1.6 and the "new front-end" already handle this right. Here's the
bugzilla bug corresponding to it:
http://llvm.cs.uiuc.edu/bugs/show_bug.cgi?id=659

Great! The bug information is rather scarce. I would be interested, how
you implemented it. Did you add another bytecode entry for the section
value mapping? Is it possible to add attributes to other elements like
functions as well?

Yes, it was added to the .ll/.bc formats:
http://llvm.cs.uiuc.edu/docs/LangRef.html#globalvars
http://llvm.cs.uiuc.edu/docs/BytecodeFormat.html#globalinfo

Did you think about a mapping of common attributes on different
platforms. For instance DLLMain Entry point under Win32 and the
__attribute__((constructor)) under Linux.

__attribute__((constructor)) is handled with a the llvm.globalctors global variable (even with llvm 1.6), try it out.

I would generally be interested and could contribute to extending LLVM
by allowing more Annotatinons than currently possible.

Okay so I am on quite the opposite attitude than the LLVM team towards
that issue :slight_smile:

I don't follow.

At one point in time, Value was annotatable. The problem with this was
two fold:

1. This bloat every value in the system, by adding an extra pointer.
2. These annotations would get stale and not be updated correctly.

The problem is basically that adding annotations really amounts to
extending the LLVM IR, and making it look like something simple doesn't
make it easier to deal with. For example, if you add an "I'm special"
attribute to an instruction, then the function is cloned by some pass,
is that attribute copied or not? What if it is deleted, moved,
rearranged, etc? Further, how can annotations be serialized to .ll
files and .bc files? In llvm, we always want "opt -pass1 -pass2" to be
the same as "opt -pass1 | opt -pass2", which would break if annotations
can't be serialized (which they can't currently).

I get you 100 % here. But as you say later in the mail, many information
is done by some runtime std::map<Value*,foo> stuff. Which is really
handy at runtime, but I *had* serialization in mind when I was thinking
about Annotations.

Okay, if you want to serialize/deserialize, they become much more palatable, the implementation just gets stickier.

I see annotations as a way to serialize some extra
information with the bytecode without having to extend/change the core
classes. The best way to implemented in runtime is to use some kind of
std::map subscripting, plus the additional benefit that you can
serialize it to the bytecode. Perhaps the best of both worlds.

That's fine, but don't think that makes them solve all of the problems. Again, there is still the updating issue.

Two things here:
(1) Annotations should not be something which really changes the meaning
of a Value/Type. All the passes should work without the annotation.

Okay, what use are they then?

Note that source language types are not unique in LLVM, and they shouldn't be even with annotations. For example:

struct X { int A; };
struct Y { int B; };

Both X and Y map to the same LLVM Type. This cannot change.

(2) I think annotations are a handy way to augment the bytecode without
changing the bytecode format. It gives people the freedom to add some
extra information. This is also interesting since changing the
bytecode/adding fields to Value/... is often not a real option since one
wants to work with production core libraries. (like I do now).

Okay.

Perhaps the thing could be solved by adding policy statemetns to
annotations. I could imagine the inventor of an Annotation should think
about how the annotation should behave during optimisation/change. So
the anntation should have a policy field which defaults to DontCare. In
that case the user of the Annotation cannot be sure that it will get
retained or something like that.

Personally, I see annotations as a convenient way to do experiments and allow rapid development. If we decide that a feature makes sense in the LLVM IR long term, it should be added as a first class feature of it.

%struct.A = type { int }
%struct.B = type { int }

BTW: How would one generate a type alias like the above through the LLVM
API?

Add two entries to the module symbol table for the same Type using
Module::addTypeName.

Very interesting. I then have to take the type by calling
Module::getTypeByName to have a second Type pointer or?

Again, see above, there is no way to distinguish between two source level types that have the same structure.

since I saw the llvm-gcc generates code like:

%pa = alloca %struct.A
%pb = alloca %struct.B

this means that the AllocaInst must have knowledge about two types which
can only be so by having two different pointers? right?

This is an implementation detail of the old llvm-gcc that breaks with the new one. Do not depend on it.

But sometimes it would be interesting to actually get symbol information
about a type beeing used, without the need for full featured debug
information ala DWARF.

This isn't something you can do, this is far more tricky than you make
it out to be. :slight_smile:

Hehe, I know. Certainly if someone like you says that. But *if* the
front end is aware of the annotation, which would be doable, and the
annations are serializable in bytecode, then one would have this
information during LLVM bytecode processing as well.

Yes. However, there would be no way to keep isomorphic LLVM types separate. This dramatically limits the usefulness of what you're trying to do.

One could also emit the symbolic information like Relocation informations in a .section or as I am currently working with the JIT - use the JITs information about the annotations to get the symbolic information.

I don't understand.

This could be also solved by introducing Annotable.
For instance if the alloca/malloc/.. instruction would get an Anntation
about a symbolic type which could look like:

{ ("x",int) }

One could use the DEF/USE and operand information in the byte code to
know which symbolic field was accessed for instance through
getelementptr.

Again, this is effectively extending the LLVM IR. Calling it an
'annotation' doesn't make it simpler. :slight_smile: Also, the front-end would have
to be modified to generate the annotation.

See above. Every information must be understood in order to be usable.
But I would do it as an annotation since it is just additional meta
information and the program would perfectly run without the information.

Hopefully I made the issue more clear above.

-Chris

In thinking about the "right" way to do this, I came up with the idea of
a single "blob" of data that could be appended to a Module. This single
"annotation" would always be ignored by LLVM, would not require
significant additional space to construct,

If you're talking about a blob of binary data, which does not have "pointers" into the LLVM code, this is fine. However, there is no reason it needs to be stored "in" the BC format. It can live "next" to it in the same file without impacting the bc format.

and there is already a
mechanism for constructing the information via the bytecode reader's
handler interface (might need some extension).

Not clear. If you're having "pointers" from the blob into the LLVM code (which I believe you would need for it to be useful), then the compiler code that manipulates the LLVM IR has to be aware of these pointers, must be able to update them when it makes changes, etc. This is extremely nontrivial: exactly the same problem as full fledged annotations.

Additionally, the LLVM linker would have to know how to merge these blobs of data (simple concatenation would be fine, more complex things probably not).

(JIT/storage/etc). The only question is: how do multiple tools avoid
collision in this approach. Some kind of registry or partitioning of the
data could likely solve that.

Also an issue.

-Chris

I think this would also help with compiling scripting languages such
as JavaScript/Python etc. We could keep the high level meta data and
runtime binding info as language specific bytecode in the file and
just have the parts that are easy to represent as compileable in the
main object sections.

Attaching (e.g.) Java bytecode or Javascript code to an LLVM module shouldn't be a problem. The LLVM bcreader can read LLVM bytecode from a subset of a file.

There is no intrinsic reason for all the runtime
type information to get compiled into the core object module. Also I
could bypass code thats difficult to compile and just stuff its
bytcode into this section. So I think this really helps with partial
compliation and supporting languags that have complex runtimes.
The llvm bycode section would just get a stub runtime upcall for code
that not compiled.

Sure, the JIT does something similar to this, without annotations.

For java for example this would probably be the compiled parts with
stubs and a regular classfile for the runtime data with compiled
functions converted to native.

Makes sense. Again, you can do this today, without annotations.

-Chris

> In thinking about the "right" way to do this, I came up with the idea of
> a single "blob" of data that could be appended to a Module. This single
> "annotation" would always be ignored by LLVM, would not require
> significant additional space to construct,

If you're talking about a blob of binary data, which does not have
"pointers" into the LLVM code, this is fine. However, there is no reason
it needs to be stored "in" the BC format. It can live "next" to it in the
same file without impacting the bc format.

That's what I meant .. just a chunk after the BC content (BC=Module).

As for pointers .. was thinking about exposing slot numbers or something
so you can get a handle on values.

> and there is already a
> mechanism for constructing the information via the bytecode reader's
> handler interface (might need some extension).

Not clear. If you're having "pointers" from the blob into the LLVM code
(which I believe you would need for it to be useful), then the compiler
code that manipulates the LLVM IR has to be aware of these pointers, must
be able to update them when it makes changes, etc. This is extremely
nontrivial: exactly the same problem as full fledged annotations.

Yeah, I'm not sure either. I was just thinking that the Handler
interface could be used to expose the constructs being parsed ..
however, on deeper reflection, that is only called on *reading* and not
at all on *writing* .. so there goes that idea.

Additionally, the LLVM linker would have to know how to merge these blobs
of data (simple concatenation would be fine, more complex things probably
not).

Yeah, concatenation is fine.

> (JIT/storage/etc). The only question is: how do multiple tools avoid
> collision in this approach. Some kind of registry or partitioning of the
> data could likely solve that.

Also an issue.

Yup.

hi,

Chris Lattner schrieb:

thanks for your reply.

Yes, it was added to the .ll/.bc formats:
http://llvm.cs.uiuc.edu/docs/LangRef.html#globalvars
http://llvm.cs.uiuc.edu/docs/BytecodeFormat.html#globalinfo

Interesting. I will check it out.

Did you think about a mapping of common attributes on different
platforms. For instance DLLMain Entry point under Win32 and the
__attribute__((constructor)) under Linux.

__attribute__((constructor)) is handled with a the llvm.globalctors
global variable (even with llvm 1.6), try it out.

Great!

Okay so I am on quite the opposite attitude than the LLVM team towards
that issue :slight_smile:

I don't follow.

All I wanted to say here is that while I thought Function was the first
Value beeing annotatable, it turned out to remain the last :slight_smile:

At one point in time, Value was annotatable. The problem with this was
two fold:

1. This bloat every value in the system, by adding an extra pointer.
2. These annotations would get stale and not be updated correctly.

The problem is basically that adding annotations really amounts to
extending the LLVM IR, and making it look like something simple doesn't
make it easier to deal with. For example, if you add an "I'm special"
attribute to an instruction, then the function is cloned by some pass,
is that attribute copied or not? What if it is deleted, moved,
rearranged, etc? Further, how can annotations be serialized to .ll
files and .bc files? In llvm, we always want "opt -pass1 -pass2" to be
the same as "opt -pass1 | opt -pass2", which would break if annotations
can't be serialized (which they can't currently).

I get you 100 % here. But as you say later in the mail, many information
is done by some runtime std::map<Value*,foo> stuff. Which is really
handy at runtime, but I *had* serialization in mind when I was thinking
about Annotations.

Okay, if you want to serialize/deserialize, they become much more
palatable, the implementation just gets stickier.

Hmm not sure I understand you here. What I don't want is that the spec
and the implementation intermix. I think if there is a serialization in
use it should have a well known format. That have to be easy to beeing
operated on with small tools. It should be made of primitve types that
can be composed (like the Struct Type for instance). For instnace an
annotation value could refer to a Typeslot, which means Annotions uses
the LLVM types.

I see annotations as a way to serialize some extra
information with the bytecode without having to extend/change the core
classes. The best way to implemented in runtime is to use some kind of
std::map subscripting, plus the additional benefit that you can
serialize it to the bytecode. Perhaps the best of both worlds.

That's fine, but don't think that makes them solve all of the problems.
Again, there is still the updating issue.

Hmm. I have not too much experience here. So I think you are right in
this regard. See below. Perhaps the update issue is really in the domain
of the annotation writer. If the person does not think about it, then
the combination of two annotations should be left out. I think one could
implement them using a callback. But see below.

Two things here:
(1) Annotations should not be something which really changes the meaning
of a Value/Type. All the passes should work without the annotation.

Okay, what use are they then?

I think the difference is mission critical versus usable. I think
metadata of this kind is very usable, but should not stop non-aware
passes from running. So it is optional. I put it in wrong way obove.
Values should get augmented by more metadata, but not change the way the
other passes work.

Note that source language types are not unique in LLVM, and they
shouldn't be even with annotations. For example:

struct X { int A; };
struct Y { int B; };

Both X and Y map to the same LLVM Type. This cannot change.

This is alright. And I am aware of that. LLVM has a structural
equivalent type system. And I think it is the right in terms of LIR.
But you can for instance tag the alloca instruction with an annotation,
which adds symbolic type information. Since the alloca binds the type to
a stack location this would be an option. (and other
allocation/getelementptr instruction as well).

Perhaps the thing could be solved by adding policy statemetns to
annotations. I could imagine the inventor of an Annotation should think
about how the annotation should behave during optimisation/change. So
the anntation should have a policy field which defaults to DontCare. In
that case the user of the Annotation cannot be sure that it will get
retained or something like that.

Personally, I see annotations as a convenient way to do experiments and
allow rapid development. If we decide that a feature makes sense in the
LLVM IR long term, it should be added as a first class feature of it.

Hmm. In my view most of the annotations should be just metadata. Like
the example above. But if it turns out that an annotation is more than
this, and valueable this would be the right way to do.

since I saw the llvm-gcc generates code like:

%pa = alloca %struct.A
%pb = alloca %struct.B

this means that the AllocaInst must have knowledge about two types which
can only be so by having two different pointers? right?

This is an implementation detail of the old llvm-gcc that breaks with
the new one. Do not depend on it.

Okay. Thanks for the info.

Hehe, I know. Certainly if someone like you says that. But *if* the
front end is aware of the annotation, which would be doable, and the
annations are serializable in bytecode, then one would have this
information during LLVM bytecode processing as well.

Yes. However, there would be no way to keep isomorphic LLVM types
separate. This dramatically limits the usefulness of what you're trying
to do.

Maybe I was not clear.
I want to attach extra informfation at use time of variables for
instance. This information is optional. So I don't divide isomorphic
LLVM types. Every type is bound to a variable through a special
instruction. If you annotate these instructions you can always find the
symbolic type of a such a variable. This is just one use of annotations.
It clearly does not change the meaning of instructions, but only attach
meta information.

For later use: You can then use the information, for instance:
-> in the JIT (easy)
-> during code generation, you can add symbol information to memory
address much like the relocation entries in ELF. So you can use the
address of the variable to get its symbolic type.

See above. Every information must be understood in order to be usable.
But I would do it as an annotation since it is just additional meta
information and the program would perfectly run without the information.

Hopefully I made the issue more clear above.

Hope so too. Please tell me what you think. But I could see many uses of
simple meta annotations.