Packages

I’m wondering if LLVM has or should have support for a grouping of modules (which I’ll call a package). That is, a package is a partial program contains many (probably related) modules. One might roughly compare it to a shared library. The reason that it is important is that we want to (a) distribute packages, not individual modules and (b) optimize the entire package (e.g. IPO) as a unit before distribution.

In other words, I’d like to take a set of bytecode files, optimize them together even though they don’t form a complete program, and then write out the new (optimized) bytecode files. It would be preferable to write them out to a single archive rather than to individual bytecode files again.

Can this be done in LLVM today? If not, what would it take to implement?

Reid.

In other words, I'd like to take a set of bytecode files, optimize them
together even though they don't form a complete program, and then write
out the new (optimized) bytecode files. It would be preferable to write
them out to a single archive rather than to individual bytecode files
again. Can this be done in LLVM today? If not, what would it take to
implement?

I'm not sure exactly what you want, but we do already have a form of this.
If you use the low-level llvm-link program (or gccld --link-as-library),
it will take all of the .bc files specified, link them together into a
_single_ bytecode file, and write that out (gccld performs IPO on the
result).

Is this what you mean? Is there any reason to keep the .bc files distinct
like a ".a" file, or is it ok for your purposes to link them together into
a single unit, like a ".so" file?

-Chris

<i>> In other words, I'd like to take a set of bytecode files, optimize them
> together even though they don't form a complete program, and then write
> out the new (optimized) bytecode files. It would be preferable to write
> them out to a single archive rather than to individual bytecode files
> again.  Can this be done in LLVM today? If not, what would it take to
> implement?

I'm not sure exactly what you want, but we do already have a form of this.
If you use the low-level llvm-link program (or gccld --link-as-library),
it will take all of the .bc files specified, link them together into a
_single_ bytecode file, and write that out (gccld performs IPO on the
result).</i>

Okay.

<i>
Is this what you mean?  Is there any reason to keep the .bc files distinct
like a ".a" file, or is it ok for your purposes to link them together into
a single unit, like a ".so" file?</i>

Well, the answer depends on whether the individual modules are separately loadable or not. Suppose the
resulting bytecode file gets JIT loaded into a running program. Is it all or nothing? What if the calling program
new it needed a specific module from the larger optimized “package”? Can the original modules, now
combined into the one .bc file, get loaded individually?

For example, using Java, suppose I wanted to use java.util.zip.CRC32 to compute a checksum in a program.
That class (module) is delivered to me as a java.util.zip “package”. That is, it is the result of running llvm-link
to produce a single bytecode file that includes all the classes with java.util.zip prefix. But (for some mysterious
reason) the java.util.zip/CRC32 class was not compiled/linked/optimized into my program; its just referenced
from it. At runtime, the JIT needs to locate this module and compile/link it into my program. Now, if the
CRC32 class (module) is part of a java.util.zip package (i.e. the output of llvm-link), what happens?
Is the entire package loaded providing more than my program needs? Is the runtime linker smart enough to
locate the CRC32 module in the optimized package output? Is it smart enough to load only the one module
needed? Does the notion of module inside a module even exist?

P.S. Bug 114 (-Wold-style-cast) has a patch ready.

Reid

> Is this what you mean? Is there any reason to keep the .bc files distinct
> like a ".a" file, or is it ok for your purposes to link them together into
> a single unit, like a ".so" file?

Well, the answer depends on whether the individual modules are
separately loadable or not. Suppose the resulting bytecode file gets JIT
loaded into a running program. Is it all or nothing? What if the calling
program new it needed a specific module from the larger optimized
"package"? Can the original modules, now combined into the one .bc file,
get loaded individually?

No, it's all or nothing. Once linked, they cannot be seperated (easily).
However, especially when using the JIT, there is little overhead for
running a gigantic program that only has 1% of the functions in it ever
executed...

For example, using Java, suppose I wanted to use java.util.zip.CRC32 to
compute a checksum in a program. That class (module) is delivered to me
as a java.util.zip "package". That is, it is the result of running
llvm-link to produce a single bytecode file that includes all the
classes with java.util.zip prefix. But (for some mysterious reason) the
java.util.zip/CRC32 class was not compiled/linked/optimized into my
program; its just referenced from it. At runtime, the JIT needs to
locate this module and compile/link it into my program. Now, if the
CRC32 class (module) is part of a java.util.zip package (i.e. the output
of llvm-link), what happens? Is the entire package loaded providing more
than my program needs? Is the runtime linker smart enough to locate the
CRC32 module in the optimized package output? Is it smart enough to load
only the one module needed? Does the notion of module inside a module
even exist?

There are multiple different ways to approach these questions depending on
what we want to do and what the priorities are. There are several good
solutions, but for now, everything needs to be statically linked. I
expect this to change over the next month or so.

P.S. Bug 114 (-Wold-style-cast) has a patch ready.

I realize that. I will get to it this afternoon, I'm still catching up
from PLDI.

-Chris

<i>No, it's all or nothing.  Once linked, they cannot be seperated (easily).
However, especially when using the JIT, there is little overhead for
running a gigantic program that only has 1% of the functions in it ever
executed...</i>

Perhaps in the general case, but what if its running on an embedded system and the “gigantic program”
causes an out-of-memory condition?

<i>There are multiple different ways to approach these questions depending on
what we want to do and what the priorities are.  There are several good
solutions, but for now, everything needs to be statically linked.  I
expect this to change over the next month or so.</i>

When you have time, I’d like to hear what you’re planning in this area as it will directly effect how I
build my compiler and VM.

Thanks,

Reid.

> No, it's all or nothing. Once linked, they cannot be seperated (easily).
> However, especially when using the JIT, there is little overhead for
> running a gigantic program that only has 1% of the functions in it ever
> executed...

Perhaps in the general case, but what if its running on an embedded
system and the "gigantic program"
causes an out-of-memory condition?

The JIT doesn't even load unreferenced functions from the disk, so this
shouldn't be the case... (thanks to Misha for implementing this :slight_smile:

Also, the globaldce pass deletes functions which can never be called by
the program, so large hunks of libraries get summarily removed from the
program after static linking.

> There are multiple different ways to approach these questions depending on
> what we want to do and what the priorities are. There are several good
> solutions, but for now, everything needs to be statically linked. I
> expect this to change over the next month or so.

When you have time, I'd like to hear what you're planning in this area
as it will directly effect how I build my compiler and VM.

What do you need, and what would you like? At this point there are
several solutions that make sense, but they have to be balanced against
practical issues. For example, say we do IPO across the package, and then
one of the members get updated. How do we know to invalidate the results?

As I think that I have mentioned before, one long-term way of implementing
this is to attach analysis results to bytecode files as well as the code.
Thus, you could compile libc, say, with LLVM to a "shared object" bytecode
file. While doing this, the optimizer could notice that "strlen" has no
side-effects, for example, and attach that information to the bytecode
file.

When linking a program that uses libc, the linker wouldn't pull in any
function bodies from "shared objects", but would read the analysis results
and attach them to the function prototypes in the program. This would
allow the LICM optimizer, to hoist strlen calls out of loops when it makes
sense, for example.

Of course there are situations when it is better to actually link the
function bodies into the program too. In the strlen example, it might be
the case that the program will go faster if strlen is inlined into a
particular call site.

I'm inclined to start simple and work our way up to these cases, but if
you have certain usage patterns in mind, I would love to hear them, and we
can hash out what will really get implemented...

-Chris

Chris Lattner wrote:

No, it's all or nothing. Once linked, they cannot be seperated (easily).
However, especially when using the JIT, there is little overhead for
running a gigantic program that only has 1% of the functions in it ever
executed...

Perhaps in the general case, but what if its running on an embedded
system and the "gigantic program"
causes an out-of-memory condition?

The JIT doesn't even load unreferenced functions from the disk, so this
shouldn't be the case... (thanks to Misha for implementing this :slight_smile:

Also, the globaldce pass deletes functions which can never be called by
the program, so large hunks of libraries get summarily removed from the
program after static linking.

There are multiple different ways to approach these questions depending on
what we want to do and what the priorities are. There are several good
solutions, but for now, everything needs to be statically linked. I
expect this to change over the next month or so.

When you have time, I'd like to hear what you're planning in this area
as it will directly effect how I build my compiler and VM.

What do you need, and what would you like? At this point there are
several solutions that make sense, but they have to be balanced against
practical issues. For example, say we do IPO across the package, and then
one of the members get updated. How do we know to invalidate the results?

As I think that I have mentioned before, one long-term way of implementing
this is to attach analysis results to bytecode files as well as the code.
Thus, you could compile libc, say, with LLVM to a "shared object" bytecode
file. While doing this, the optimizer could notice that "strlen" has no
side-effects, for example, and attach that information to the bytecode
file.

While on the subject of annotating bytecode with analysis info, could I entice someone to also think about carrying other types of source-level annotations through into bytecode ? This is particularly useful for situations where one wants to use LLVM infrastructure for its whole-program optimization capabilities, however wouldn't want to give up on the ability to debug the final product binary. At the moment, my understanding is that source code annotations like file names, line numbers etc isn't carried through. When one gets around to linking the whole program, you end up with a single .s file of native machine code (which by now is a giant collection of bits picked up from a multitude of source files) with no ability to do symbolic debugging on the resulting binary...

While on the subject of annotating bytecode with analysis info, could I
entice someone to also think about carrying other types of source-level
annotations through into bytecode ? This is particularly useful for
situations where one wants to use LLVM infrastructure for its
whole-program optimization capabilities, however wouldn't want to give
up on the ability to debug the final product binary. At the moment, my
understanding is that source code annotations like file names, line
numbers etc isn't carried through. When one gets around to linking the

Yes, this is very true. This is on my medium-term todo list. LLVM will
definitely support this, its just that we want to do it right and we are
focusing on other issues at the moment (like performance).

At the moment, the best way to debug LLVM compiled code is to use the C
backend, compile with -g, and suffer through the experience. :frowning:

Luckily, when writing LLVM optimizations and such, bugpoint makes things
much much nicer. :slight_smile:

-Chris

The JIT doesn't even load unreferenced functions from the disk, so this
shouldn't be the case... (thanks to Misha for implementing this :slight_smile:

Also, the globaldce pass deletes functions which can never be called by
the program, so large hunks of libraries get summarily removed from the
program after static linking.

Oh! Good news!

> > There are multiple different ways to approach these questions depending on
> > what we want to do and what the priorities are. There are several good
> > solutions, but for now, everything needs to be statically linked. I
> > expect this to change over the next month or so.

> When you have time, I'd like to hear what you're planning in this area
> as it will directly effect how I build my compiler and VM.

What do you need, and what would you like? At this point there are
several solutions that make sense, but they have to be balanced against
practical issues.

You'll find that I'm a pragmatic idealist. :slight_smile:

So, ultimately, what I would >like< is a fully indexed bzip2 compressed
archive of byte code files that could be loaded or partially loaded into
a running program. Individual components of the package (i.e. the
modules it is composed of) would retain their identity through the
optimization and linking process. Such an archive would serve as the
"package" or "shared object" we referenced earlier in this discussion.
The JIT would still avoid loading unreferenced modules, functions, or
variables from the archive.

The construction of such an archive would first optimize the bytecode
together as a package (like running opt on the whole thing first) but
would ensure that external linkage functions and global variables were
not eliminated. Because the package is to be loaded as a plug-in,
certain functions and global variables must not be eliminated during
optimization. To support this kind of partial optimization, it might be
useful to add an "interface" keyword to LLVM assembly that indicates a
function or global variable that is not to be eliminated under any
circumstances.

Does this capability exist now under a different name?

What I >need< is something that fits the requirements for my VM (see
below).

For example, say we do IPO across the package, and then
one of the members get updated. How do we know to invalidate the results?

Use the time stamp on the package's file?

If it changes, we'd essentially have to do the optimizations again?

As I think that I have mentioned before, one long-term way of implementing
this is to attach analysis results to bytecode files as well as the code.
Thus, you could compile libc, say, with LLVM to a "shared object" bytecode
file. While doing this, the optimizer could notice that "strlen" has no
side-effects, for example, and attach that information to the bytecode
file.

Yes, this is an excellent idea as it will avoid re-computation of
analysis results and thereby speed up optimized linking. See my
(upcoming) comments on this topic in response to Vipin's posting on the
same.

When linking a program that uses libc, the linker wouldn't pull in any
function bodies from "shared objects", but would read the analysis results
and attach them to the function prototypes in the program. This would
allow the LICM optimizer, to hoist strlen calls out of loops when it makes
sense, for example.

Sounds good.

Of course there are situations when it is better to actually link the
function bodies into the program too. In the strlen example, it might be
the case that the program will go faster if strlen is inlined into a
particular call site.

Yes, in fact, one might want to take a program running JIT and tell it
to just completely optimize it and write a native executable for the
next time it runs.

I'm inclined to start simple and work our way up to these cases, but if
you have certain usage patterns in mind, I would love to hear them, and we
can hash out what will really get implemented...

-Chris

Okay, at the risk of being verbose, here goes. I'm going to relate the
requested "usage patterns" to the virtual machine I'm constructing named
XVM.

In XVM, there are three essential parts to a program: the user's code,
the XVM code, and the XVM plug-ins. The user's code is compiled with
full knowledge of the services available from the XVM code, but not with
any knowledge of the plug-ins. The XVM completely hides (on purpose)
the details of the plug-ins. They are "implementation details".
Similarly, the plug-ins implement a standard API and are completely
unaware of the end-user's program (more on this below).

While I had originally intended to compile XVM itself to fast native
code using GCC, I'm beginning to believe that it might be more effective
to compile it to bytecode so that the XVM itself could be optimized for
the way the user's code uses it. Or, perhaps do it both ways and leave
the determination of how a program gets optimized (or not!) until it is
executed.

User's programs executed by the XVM do not have a "main". The XVM itself
implements "main", handles configuration, manages memory and access to
system resources very strictly, etc. What >is< executed from the user's
code are tasks. A task is simply a non-blocking function that has an
input queue and zero or more output queues. The queues are managed by
the XVM, not the task. The tasks run asynchronously to each other. Each
task registers for certain kinds of events and produces other events. To
get it started, there is an "initial event" that some task must register
for. Think of an event as the set of arguments to a function. When the
XVM executes a user's program, it simply generates the initial event and
provides processing time to these tasks. If the program generates a
terminate event, the XVM shuts down after allowing the program to
completely process the terminate event. As the program executes, there
are XVM operations that the user's program can invoke to load or unload
other tasks. In this way the nature of the "program" can change over
time. The reason for this is that a program might have multiple
operating modes (e.g. online vs. batch) during which completely
different sets of tasks execute independently or in conjunction with one
another (i.e. the online tasks might behave differently while batch is
executing but that is simply a fact of loading the batch tasks, not
something inherent or coded into the online tasks). There are many
program design choices that this facility enables. If you're familiar
with "active objects", a task is somewhat akin to that.

I imagine this scenario will be rather difficult for LLVM to optimize at
runtime. This is one of the reasons that I was asking about creating
optimized packages. Presumably the modules necessary to implement some
set of related tasks (e.g. the batch tasks) would all be combined and
optimized into a single package. Not everything in the package may get
executed because it depends on the events being generated by the other
tasks already executing. This is important because the XVM doesn't
necessarily know the full graph of tasks it will execute when it starts.
In general, it won't even know the full set before it finishes! Tasks
(and consequently the packages they live in) can be loaded very
dynamically by the user's program, even deciding to use variants of
similar task packages at different times or basing the decision on some
configuration file that gets read at runtime or any other criteria.
Indeed, it will, in general, not be possible for LLVM to know which
packages are going to eventually get loaded. How would LLVM address
optimizing such a program? My solution so far has been to just optimize
the hell out of the individual packages and leave it at that. But, given
that LLVM's thing is "continuous whole program optimization", can it
address an environment where it doesn't even know what the "whole
program" is?

Back to plug-ins. XVM supports plug-ins for two reasons: extensibility,
and alternative implementations of standard APIs. The XVM knows about
security, transactions, database access, memory management, naming
services, etc. Each of these capabilities has a standardized API or set
of APIs. Multiple implementations of an API can co-exist in the XVM. For
example, there might be a database access plug-in for each of Oracle,
MySql, SqlServer, etc. Similarly for security providers, naming
services, etc. So, what I need is the ability to load multiple
packages, all implementing the same API (presumably through a function
pointer table). I would like these packages to be distributed as either
real shared objects (native code) or compressed archives of bytecode. In
either case the package is a single file.

For extensibility, XVM allows new APIs to be added to the machine such
that programs that know about the API can invoke them. That is, it is
possible to create in XPL both an extension (that plugs in to the back
end of the XVM) and a program that uses the extension right up to the
soruce code level. To make this concrete, suppose we had the need to
extend the base XPL programming language to provide a complete package
for dealing with complex and rational numbers rather than just ints and
floats as is supported in the base language. This would be achieved
with a plug-in to both the XPL compiler (allowing interpretation of the
new fundamental types and their operators) and the XVM backend. The
plug-in would implement the operators on the complex and rational
numbers. The compiler would invoke those operators through the plug-in's
interface.

If statically compiled native code is provided for these plug-ins, the
loader just loads the shared object and probably doesn't attempt any
optimizations. If the analysis results you spoke of could be passed
through to native shared objects then I suppose some optimizations could
be done by the loader but I'm not particularly concerned about that
because presumably the optimizations in the loaded package are
sufficient (i.e. each interface entry point represents a significantly
large amount of processing that optimizing its binding into the larger
program is negligible).

If bytecode is provided, the loader has the opportunity to do "whole
program" optimization and should (optionally) do that.

In both the above cases (native code and bytecode) it is likely that the
entire set of API calls will be invoked by the XVM because, presumably,
the XVM has calls to each of them somewhere (even though the user's
program may not invoke the XVM functionality that would result in some
of the plug-in calls being executed). This means that there is little
to be gained in eliminating external functions in the packages because
at some point they >might< get called. To fully optimize this, what is
needed is a "whole program" optimization that incorporates the end
user's program, with the XVM code, and the plug-in code. And, it needs
to work regardless of whether we're speaking of static executables (e.g.
statically linked XVM), dynamically generated native code (e.g. JIT), or
bytecode. And, its complicated by the fact that we don't know ahead of
time what plug-ins or packages will get loaded.

The XVM also needs to be able to run in one of four modes, that I've
described before:
     1. Script: goal is to begin executing quickly, do little
        optimization, and get the program execution over with quickly.
        This is for running "quickie" programs where you don't want to
        wait a long time for compilation, optimization, linking or
        loading because the task at hand is very short.
     2. Filter: goal is to be moderately efficient but still begin
        execution quickly. I would expect filters to be optimized in
        bytecode and then executed directly from that byte code.
     3. Client: optimize and compile everything as it is loaded and
        used. The nature of "Client" programs is that they are
        unpredictable in what they will use because they are directed by
        a human.
     4. Server: compile, optimize and link the entire program statically
        using as many optimizations as possible. The program is intended
        to be a long running service.

I think all these cases are covered by existing LLVM functionality but
thought I'd mention it for completeness.

Sorry for being so long winded .. not sure how else to describe the
usage scenario other than just putting it all out there.

Reid.

I whole heartedly second that motion.

My purposes are a little different, however. The language for which I'm
compiling (XPL) is fairly high level. For example, data structures such
as hash tables and red black trees are simply referenced as "maps" which
map one type to another. What exact data structure is used underneath is
up to the compiler and runtime optimizer, even allowing transformation
of the underlying type at runtime. For example, a map that initially
contains 3 elements would probably just be a vector of pairs because its
pretty straight forward to linearly scan a small table and it is space
efficient. But, as the map grows in size, it might transform itself into
a sorted vector so binary search can be used and then into a hash table
to reduce the overhead of searching further and then again later on into
a full red-black tree. Of course, all of this depends on whether
insertions and deletions are more frequent than look ups, etc.

The point here is that XPL needs to keep track of what a given variable
represents at the source level. If the compiler sees a map that is
initially small it might represent it in LLVM assembly as a vector of
pairs. Later on, it gets optimized into being a hash table. In order to
do that and keep track of things, I need to know that the vector of
pairs is >intended< to be a map, not simply a vector of pairs.

Another reason to do this is to speed up compilation time. XPL works
similarly to Java in that you define a module and "import" other modules
into it. I do not want to recompile a module each time it is imported.
I'd rather just save the static portion of the syntax tree (i.e. the
declarations) somewhere and load it en masse when its referenced in
another compilation. Currently, I have a partially implemented solution
for this based on my persistent memory module (like an object database
for C++ that allows you to save graphs of objects onto disk via virtual
memory management tricks). When a module is referenced in an import
statement, its disk segment is located and mapped into memory in one
shot .. no parsing, no linking together, just instantly available. For
large software projects with 1000s of modules, this is a HUGE
compilation time win.

Since finding LLVM, I'm wondering if it wouldn't be better to store all
the AST information in the bytecode file so that I don't have
compilation information in one place and the code for it in another. To
do this, I'd need support from LLVM to put "compile time information"
into a bytecode or assembly file. This information would never be used
at runtime and never "optimized out". It just sits in the bytecode file
taking up space until some compiler (or other tool) asks for it.

I've given some thought to this and here's how I think it should go:

     1. Compile time information is placed in separate section of the
        bytecode file (presumably at the end to reduce runtime I/O)
     2. Nothing in the compile time information is used at runtime. It
        is neither the subject of optimization nor execution.
     3. Compile time information sections are completely optional. A
        given language compiler need not utilize them and they have no
        bearing on correct execution of the program.
     4. Compile time information is loaded only explicitly (presumably
        by a compiler based on LLVM) but also possibly by an
        optimization pass that would like to understand the higher-order
        semantics better (this would require the pass to be language
        specific, presumably).
     5. Compile time information is defined as a set of global variables
        just the same as for the runtime definitions. The full use of
        LLVM Types (especially derived types like structures and
        pointers) can be used to define the global variables.
     6. There are never any naming conflicts between compile time
        information variables in different modules. Each compile time
        global variable is, effectively, scoped in its module. This
        allows compiler writers to use the same name for various pieces
        of data in every module emitted without clashing.
     7. The exact same facility for dealing with module scoped types and
        variables are used to deal with the compile time information.
        When asked for it, the VMCore would produce a SymbolTable that
        references all the global types and variables in the compile
        time information.
     8. LLVM assembler and bytecode reader will assure the syntactic
        integrity of the compile time information as it would for any
        other bytecode. It checks types, pointer references, etc. and
        emits warnings (errors?) if the compiler information is not
        syntactically valid.
     9. LLVM makes no assertions about the semantics or content of the
        compile time information. It can be anything the compiler writer
        wishes to express to retain compilation information. Correctness
        of the information content (beyond syntactics) is left to the
        compiler writer. Exceptions to this rule may be warranted where
        there is general applicability to multiple source languages.
        Debug (file & line number) info would seem to be a natural
        exception.
    10. Compile time information sections are marked with a name that
        relates to the high-level compiler that produced them. This
        avoids confusion when one language attempts to read the compile
        time information of another language.

This is somewhat like an open ended, generalized ELF section for keeping
track of compiler and/or debug information. Because its based on
existing capabilities of LLVM, I don't think it would be particularly
difficult to implement either.

Reid.

The point here is that XPL needs to keep track of what a given variable
represents at the source level. If the compiler sees a map that is
initially small it might represent it in LLVM assembly as a vector of
pairs. Later on, it gets optimized into being a hash table. In order to
do that and keep track of things, I need to know that the vector of
pairs is >intended< to be a map, not simply a vector of pairs.

Absolutely. No matter what source language you're interested in, you want
to know about _source_ variables/types/etc, not about LLVM varaibles,
types, etc.

Another reason to do this is to speed up compilation time. XPL works
similarly to Java in that you define a module and "import" other modules
into it. I do not want to recompile a module each time it is imported.

Makes sense . On the LLVM side of the fence, we are planning on making the
JIT cache native translations, so you only need to pay the translation
cost the first time a function is executed. This is also plays into the
'offline compilation' idea as well.

Since finding LLVM, I'm wondering if it wouldn't be better to store all
the AST information in the bytecode file so that I don't have
compilation information in one place and the code for it in another.
To do this, I'd need support from LLVM to put "compile time information"
into a bytecode or assembly file. This information would never be used
at runtime and never "optimized out". It just sits in the bytecode file
taking up space until some compiler (or other tool) asks for it.

Makes sense. The LLVM bytecode file is packetized to specifically
support these kinds of applications. The bytecode reader can skip over
sections it doesn't understand. The unimplemented part is figuring out a
format to put this into the .ll file (probably just a hex dump or
something), and having the compiler preserve it through optimization.

     5. Compile time information is defined as a set of global variables
        just the same as for the runtime definitions. The full use of
        LLVM Types (especially derived types like structures and
        pointers) can be used to define the global variables.

If you just want to do this _today_ you already can. We have an
"appending" linkage type which can make this very simple. Basically
global arrays with appending linkage automatically merge together when
bytecode files are linked (just like 'section' are merged in a traditional
linker). If you want to implement your extra information using globals,
that is no problem, they will just always be loaded and processed.

     6. There are never any naming conflicts between compile time
        information variables in different modules. Each compile time
        global variable is, effectively, scoped in its module. This
        allows compiler writers to use the same name for various pieces
        of data in every module emitted without clashing.

If you use the appending linkage mechanism, you _want_ them to have the
same name. :slight_smile:

     7. The exact same facility for dealing with module scoped types and
        variables are used to deal with the compile time information.
        When asked for it, the VMCore would produce a SymbolTable that
        references all the global types and variables in the compile
        time information.

If you use globals directly, you can just use the standard stuff.

     8. LLVM assembler and bytecode reader will assure the syntactic
        integrity of the compile time information as it would for any
        other bytecode. It checks types, pointer references, etc. and
        emits warnings (errors?) if the compiler information is not
        syntactically valid.

How does it do this if it doesn't understand it? I thought it would just
pass it through unmodified?

     9. LLVM makes no assertions about the semantics or content of the
        compile time information. It can be anything the compiler writer
        wishes to express to retain compilation information. Correctness
        of the information content (beyond syntactics) is left to the
        compiler writer. Exceptions to this rule may be warranted where

This seems to contradict #8.

        there is general applicability to multiple source languages.
        Debug (file & line number) info would seem to be a natural
        exception.

Note that debug information doesn't work with this model. In particular,
when the LLVM optimizer transmogrifies the code, it has to update the
debug information to remain accurate. This requires understanding (at
some level) the debug format.

    10. Compile time information sections are marked with a name that
        relates to the high-level compiler that produced them. This
        avoids confusion when one language attempts to read the compile
        time information of another language.

This is somewhat like an open ended, generalized ELF section for keeping
track of compiler and/or debug information. Because its based on
existing capabilities of LLVM, I don't think it would be particularly
difficult to implement either.

There are two ways to implement this, as described above:
  1. Use global arrays of bytes or something. If you want to, your arrays
     can even have pointers to globals variables and functions in them.
  2. Use an untyped blob of data, attached to the .bc file.

#2 is better from the efficiency standpoint (it doesn't need to be loaded
if not used), but #1 is already fully implemented (it is used to implement
global ctor/dtors)...

-Chris

> The point here is that XPL needs to keep track of what a given variable
> represents at the source level. If the compiler sees a map that is
> initially small it might represent it in LLVM assembly as a vector of
> pairs. Later on, it gets optimized into being a hash table. In order to
> do that and keep track of things, I need to know that the vector of
> pairs is >intended< to be a map, not simply a vector of pairs.

Absolutely. No matter what source language you're interested in, you want
to know about _source_ variables/types/etc, not about LLVM varaibles,
types, etc.

Right.

> Another reason to do this is to speed up compilation time. XPL works
> similarly to Java in that you define a module and "import" other modules
> into it. I do not want to recompile a module each time it is imported.

Makes sense . On the LLVM side of the fence, we are planning on making the
JIT cache native translations, so you only need to pay the translation
cost the first time a function is executed. This is also plays into the
'offline compilation' idea as well.

I had assumed as much but I think I'm talking about something different.
When I said "I do not want to recompile a module each time it is
imported", I meant recompile in order to get the _source_ language
descriptions only. I wouldn't recompile to get the byte codes to be
executed because (presumably) those are already available as you noted.
For example, if module A imports module B, I want to be able to just
instantaneously load from B the definitions of types, constants, global
variables and functions, as specified in the _source_ language without
going back to the _source_ and recompiling it to regenerate the
information. If we were in the C/C++ world, this would be more akin to
header file pre-compilation. I want to load the _source_ AST for a
given compiler very quickly, without revisiting the source code itself.

> Since finding LLVM, I'm wondering if it wouldn't be better to store all
> the AST information in the bytecode file so that I don't have
> compilation information in one place and the code for it in another.
> To do this, I'd need support from LLVM to put "compile time information"
> into a bytecode or assembly file. This information would never be used
> at runtime and never "optimized out". It just sits in the bytecode file
> taking up space until some compiler (or other tool) asks for it.

Makes sense. The LLVM bytecode file is packetized to specifically
support these kinds of applications. The bytecode reader can skip over
sections it doesn't understand. The unimplemented part is figuring out a
format to put this into the .ll file (probably just a hex dump or
something), and having the compiler preserve it through optimization.

Sort of. What I'm thinking of is a section that it normally skips over
(or, even better, never reaches because its at the end). However, the
contents of that section would be interpretable by LLVM if someone asked
for it. That is, the contents of the section contain constant type and
variable definitions that are _not_ part of the executable program but
are the _source_ description for the program. Those source descriptions
are specified using regular LLVM Type and variable definitions but they
don't factor into the program at all. When a bytecode file is loaded,
anything defined in such a section is just skipped over. When a compiler
or debugger asks for that section explicitly (the only way it gets
accessed), LLVM would interpret the bytecodes and give back an instance
of SymbolTable that only references Value and Type objects. These are
the types and values that the compiler writer emitted to describe the
_source_ and their semantics are up to the source compiler writer.

> 5. Compile time information is defined as a set of global variables
> just the same as for the runtime definitions. The full use of
> LLVM Types (especially derived types like structures and
> pointers) can be used to define the global variables.

If you just want to do this _today_ you already can. We have an
"appending" linkage type which can make this very simple. Basically
global arrays with appending linkage automatically merge together when
bytecode files are linked (just like 'section' are merged in a traditional
linker). If you want to implement your extra information using globals,
that is no problem, they will just always be loaded and processed.

No. These _source_ descriptions are not to be loaded and processed ever
except by explicit instruction from a compiler or debugger. For normal
program execution they are always ignored. Furthermore, they must NOT be
merged unless you just mean concatenated into one big "source
description" segment. I don't see much utility in that myself. If by
merged you mean that commonly named global symbols are reduced to a
single copy (like linkonce), then this defeats the point. What if a
compiler wanted to emit a variable named "ModuleOptions" in each
translation unit that describes the _source_ compiler options used to
compile the module. If those all get merged away, you lose the ability
to distinguish different "ModuleOptions" for different modules. This is
the reason for point #6.

> 6. There are never any naming conflicts between compile time
> information variables in different modules. Each compile time
> global variable is, effectively, scoped in its module. This
> allows compiler writers to use the same name for various pieces
> of data in every module emitted without clashing.

If you use the appending linkage mechanism, you _want_ them to have the
same name. :slight_smile:

No, you don't for the reason described above. Is there a way to retain
the unique identity of each of the variables when using appending
linkage?

> 7. The exact same facility for dealing with module scoped types and
> variables are used to deal with the compile time information.
> When asked for it, the VMCore would produce a SymbolTable that
> references all the global types and variables in the compile
> time information.

If you use globals directly, you can just use the standard stuff.

Perhaps, I'm unsure of the details but you'd need to somehow mark these
globals as "not part of the program, never execute, ignore on load,
fetch only if requested".

> 8. LLVM assembler and bytecode reader will assure the syntactic
> integrity of the compile time information as it would for any
> other bytecode. It checks types, pointer references, etc. and
> emits warnings (errors?) if the compiler information is not
> syntactically valid.

How does it do this if it doesn't understand it? I thought it would just
pass it through unmodified?

Read my statement carefully. I said "syntactic integrity" not semantics.
LLVM would ensure that, within the compile time information (i.e. source
description) there are (a) no references to undefined types, (b) no
pointers to undefined symbols, (c) etc. These are all syntactic
constructs that can be checked by LLVM without ever really understanding
what the information in the compile time information actually _means_.
That interpretation is left to the compiler writer. This just gives the
compiler writer some assurance that the content of the compile time
information at least makes some structural sense. Furthermore, this
information, even though it may represent a very complex data structure,
is treated as a big constant. There can be no variable parts (despite me
referencing this as "global variables" previously). There might, however
be relocatable parts such as a reference to an actual function or global
variable.

> 9. LLVM makes no assertions about the semantics or content of the
> compile time information. It can be anything the compiler writer
> wishes to express to retain compilation information. Correctness
> of the information content (beyond syntactics) is left to the
> compiler writer. Exceptions to this rule may be warranted where

This seems to contradict #8.

Not really. You don't want LLVM to specify to _source_ language compiler
writers what is and isn't valid semantically. In fact, you'd have a
really hard time doing so. You'd end up with (conceptually) something
like the GCC "tree" mess, trying to be all things to everyone. Why
bother? Leave that to the compiler writer. You only want LLVM to check
syntax/structure/referential integrity, etc.

> there is general applicability to multiple source languages.
> Debug (file & line number) info would seem to be a natural
> exception.

Note that debug information doesn't work with this model. In particular,
when the LLVM optimizer transmogrifies the code, it has to update the
debug information to remain accurate. This requires understanding (at
some level) the debug format.

You're right. Debug information needs to be more closely aligned with
the actual code in order for it to survive transformation. In fact, this
raises some suspicions about the viability of my approach in general. If
the source description information contains references to a function
that gets eliminated because its never called, what happens? Same thing
for types and variables at both global and function scope.

I'm off to do some serious thinking about this proposal :frowning: <<<

> 10. Compile time information sections are marked with a name that
> relates to the high-level compiler that produced them. This
> avoids confusion when one language attempts to read the compile
> time information of another language.
>
> This is somewhat like an open ended, generalized ELF section for keeping
> track of compiler and/or debug information. Because its based on
> existing capabilities of LLVM, I don't think it would be particularly
> difficult to implement either.

There are two ways to implement this, as described above:
  1. Use global arrays of bytes or something. If you want to, your arrays
     can even have pointers to globals variables and functions in them.
  2. Use an untyped blob of data, attached to the .bc file.

#2 is better from the efficiency standpoint (it doesn't need to be loaded
if not used), but #1 is already fully implemented (it is used to implement
global ctor/dtors)...

I don't think #1 works because of the naming clash issue and because it
implies that these global arrays become part of the program. I
explicitly want to forbid that because (at least in the case of XPL), I
can imagine situations where the source description information is more
voluminous than the actual program by an order of magnitude (its that
way with debug "symbol" information today).

What I want to do is emit the same named global variable (your "arrays
of bytes or something") in each module to capture information about that
module. For example, I want to emit a global array of structures that
describes the types defined in the module. I want to call that global
array "Types". If I do that in every module, what happens? I get a link
time "duplicate symbol definition" error? If I use appending linkage, I
only get one of them? This is a disaster for this type of information.
And, the name must remain constant across modules so that I can say,
"load the compile time information for module X" and then "get variable
"Types" from that compile time information. I can then peruse the type
information for that module. If I have to mangle the name in each
module, that's a little unfriendly and error prone. Furthermore, I do
NOT want this information to be part of the program. It isn't, it
describes the program.

As such, your point #2 must be accommodated. The blob of data is
normally skipped when the program is executed. But, when it is
requested, that blob of data isn't just returned to the compiler as a
blob. Because it represents a constant graph of types and values, LLVM
first checks its integrity, then instantiates the necessary C++ objects
to represent it and places them into a symbol table which is returned to
the compiler. This means the compiler can quickly look up source
descriptions in that module.

If that approach is too cumbersome for LLVM, then I would vote for just
the "blob" thing and leave it to each compiler writer to interpret the
blob correctly.

Make sense?

-Chris

Reid.

Chris,

I've done a little more thinking about this (putting source definitions
into bytecode/assembler) files. Despite my previous assertions to the
contrary, using the LLVM type and constant global declarations in the
source definitions section is a little limiting because (a) it limits
the choices for source compiler writers, (b) it might imply a larger
amount of information than would otherwise be possible, and (c) it
implies a contract between LLVM and source compiler writers that LLVM
shouldn't have to support.

So, here's what I suggest:

     1. LLVM supports a named section that contains a BLOB. LLVM doesn't
        care about the contents of the BLOB but will assist in its
        maintenance (see below). To LLVM its a name and a chunk of data.
     2. The name of the section is to allow information from different
        compilers to be distinguished.
     3. LLVM provides a C++ class that represents the named blob. I'll
        call this class "ExtraInfo".
     4. Source compiler writers can subclass ExtraInfo to their heart's
        content.
     5. The ExtraInfo class supports pure virtual methods that are
        invoked by LLVM to notify its subclass(es) when an optimization
        causes a function, type or global variable to be deleted. No
        other notifications should be necessary.
     6. During compilation, any ExtraInfo subclasses created by the
        source compiler are attached to the Module object and the
        maintenance provided in 5 is invoked automatically as
        optimizations occur.

Does this sound reasonable?

Reid.

header file pre-compilation. I want to load the _source_ AST for a
given compiler very quickly, without revisiting the source code itself.

Gotcha.

Sort of. What I'm thinking of is a section that it normally skips over
(or, even better, never reaches because its at the end). However, the
contents of that section would be interpretable by LLVM if someone asked
for it. That is, the contents of the section contain constant type and
variable definitions that are _not_ part of the executable program but
are the _source_ description for the program. Those source descriptions
are specified using regular LLVM Type and variable definitions but they
don't factor into the program at all. When a bytecode file is loaded,
anything defined in such a section is just skipped over. When a compiler

Ok, this is all cool.

or debugger asks for that section explicitly (the only way it gets
accessed), LLVM would interpret the bytecodes and give back an instance
of SymbolTable that only references Value and Type objects. These are
the types and values that the compiler writer emitted to describe the
_source_ and their semantics are up to the source compiler writer.

This isn't. I don't understand exactly what you're talking about here.
What "Value" and "type" objects can there be if LLVM doesn't understand
it? It seems to make more sense to me for the debugger or whatever to ask
for a named section, and get handed an _untyped block_ of binary data...

> If you just want to do this _today_ you already can. We have an
> "appending" linkage type which can make this very simple. Basically
> global arrays with appending linkage automatically merge together when
> bytecode files are linked (just like 'section' are merged in a traditional
> linker). If you want to implement your extra information using globals,
> that is no problem, they will just always be loaded and processed.

No. These _source_ descriptions are not to be loaded and processed ever
except by explicit instruction from a compiler or debugger. For normal

Okay...

program execution they are always ignored. Furthermore, they must NOT be
merged unless you just mean concatenated into one big "source
description" segment. I don't see much utility in that myself. If by

That's what I meant. Assuming LLVM doesn't understand the contents of it,
all it can do is concatenate.

merged you mean that commonly named global symbols are reduced to a
single copy (like linkonce), then this defeats the point. What if a

I did mean appended.

compiler wanted to emit a variable named "ModuleOptions" in each
translation unit that describes the _source_ compiler options used to
compile the module. If those all get merged away, you lose the ability
to distinguish different "ModuleOptions" for different modules. This is
the reason for point #6.

I understand.

> > 6. There are never any naming conflicts between compile time
> > information variables in different modules. Each compile time
> > global variable is, effectively, scoped in its module. This
> > allows compiler writers to use the same name for various pieces
> > of data in every module emitted without clashing.
>
> If you use the appending linkage mechanism, you _want_ them to have the
> same name. :slight_smile:
No, you don't for the reason described above. Is there a way to retain
the unique identity of each of the variables when using appending
linkage?

In the example above, the idea is that you would specify a binary blob of
data put into an LLVM global constant array of bytes. The LLVM linker
would concatenate these arrays of bytes without having any idea how to
interpret the bytes. It would be up to your compiler to be able to
interpret the meaning of the bytes and to be able to determine the
'identity of the variables' given the raw data.

> > 7. The exact same facility for dealing with module scoped types and
> > variables are used to deal with the compile time information.
> > When asked for it, the VMCore would produce a SymbolTable that
> > references all the global types and variables in the compile
> > time information.
>
> If you use globals directly, you can just use the standard stuff.

Perhaps, I'm unsure of the details but you'd need to somehow mark these
globals as "not part of the program, never execute, ignore on load,
fetch only if requested".

It would be straight-forward to make the JIT materialize globals only when
they are referenced.

> > 8. LLVM assembler and bytecode reader will assure the syntactic
> > integrity of the compile time information as it would for any
> > other bytecode. It checks types, pointer references, etc. and
> > emits warnings (errors?) if the compiler information is not
> > syntactically valid.
>
> How does it do this if it doesn't understand it? I thought it would just
> pass it through unmodified?

Read my statement carefully. I said "syntactic integrity" not semantics.
LLVM would ensure that, within the compile time information (i.e. source
description) there are (a) no references to undefined types, (b) no
pointers to undefined symbols, (c) etc. These are all syntactic
constructs that can be checked by LLVM without ever really understanding
what the information in the compile time information actually _means_.
That interpretation is left to the compiler writer. This just gives the

So you mean it checks the LLVM types and LLVM variables? I'm so confused,
I thought you were talking about source level stuff! :slight_smile:

compiler writer some assurance that the content of the compile time
information at least makes some structural sense. Furthermore, this
information, even though it may represent a very complex data structure,
is treated as a big constant. There can be no variable parts (despite me
referencing this as "global variables" previously). There might, however
be relocatable parts such as a reference to an actual function or global
variable.

Ok, that is making more sense. Yes, LLVM already supports this.

> > 9. LLVM makes no assertions about the semantics or content of the
> > compile time information. It can be anything the compiler writer
> > wishes to express to retain compilation information. Correctness
> > of the information content (beyond syntactics) is left to the
> > compiler writer. Exceptions to this rule may be warranted where
>
> This seems to contradict #8.

Not really. You don't want LLVM to specify to _source_ language compiler
writers what is and isn't valid semantically. In fact, you'd have a
really hard time doing so. You'd end up with (conceptually) something
like the GCC "tree" mess, trying to be all things to everyone. Why
bother? Leave that to the compiler writer. You only want LLVM to check
syntax/structure/referential integrity, etc.

Ok, I didn't understand what you meant by LLVM checking the structure but
not understanding the semantics. You don't mean the structure _of the
data itself_, just that the LLVM view of it is ok.

> > there is general applicability to multiple source languages.
> > Debug (file & line number) info would seem to be a natural
> > exception.
>
> Note that debug information doesn't work with this model. In particular,
> when the LLVM optimizer transmogrifies the code, it has to update the
> debug information to remain accurate. This requires understanding (at
> some level) the debug format.

You're right. Debug information needs to be more closely aligned with
the actual code in order for it to survive transformation. In fact, this
raises some suspicions about the viability of my approach in general. If
the source description information contains references to a function
that gets eliminated because its never called, what happens? Same thing
for types and variables at both global and function scope.

If a global has a pointer to a function, that function will never be
eliminated. Likewise, things interprocedural constant propagation
(leading to deletion of arguments) will never happen.

> There are two ways to implement this, as described above:
> 1. Use global arrays of bytes or something. If you want to, your arrays
> can even have pointers to globals variables and functions in them.
> 2. Use an untyped blob of data, attached to the .bc file.
>
> #2 is better from the efficiency standpoint (it doesn't need to be loaded
> if not used), but #1 is already fully implemented (it is used to implement
> global ctor/dtors)...

I don't think #1 works because of the naming clash issue and because it
implies that these global arrays become part of the program. I
explicitly want to forbid that because (at least in the case of XPL), I
can imagine situations where the source description information is more
voluminous than the actual program by an order of magnitude (its that
way with debug "symbol" information today).

I understand exactly what you're saying. Debug information in general has
this problem. It's a very reasonable, and general, performance
optimization for the JIT to never materialize globals it doesn't need, so
this in and of itself isn't hard. The hard part is that if you have
"external" pointers into the LLVM code, that those pointers will be
invalidated very quickly by general transformations. Presumably you don't
want to handcuff the optimizer too much.

What I want to do is emit the same named global variable (your "arrays
of bytes or something") in each module to capture information about that
module. For example, I want to emit a global array of structures that
describes the types defined in the module. I want to call that global
array "Types". If I do that in every module, what happens? I get a link
time "duplicate symbol definition" error?

Yes.

If I use appending linkage, I only get one of them?

No. The elements of the array will be concatenated together, as described
in:
http://llvm.cs.uiuc.edu/docs/LangRef.html#modulestructure

This is a disaster for this type of information. And, the name must
remain constant across modules so that I can say, "load the compile time
information for module X" and then "get variable "Types" from that
compile time information. I can then peruse the type information for
that module. If I have to mangle the name in each module, that's a
little unfriendly and error prone. Furthermore, I do NOT want this
information to be part of the program. It isn't, it describes the
program.

I understand. This is exactly what appending linkage is for.

If that approach is too cumbersome for LLVM, then I would vote for just
the "blob" thing and leave it to each compiler writer to interpret the
blob correctly.

This can certainly be done, but the problem is that random blobs on the
side will not be updated, and will be invalidated.

It seems to me that you're trying to address a problem semantically
equivalent to debug information, which I _want to directly address_, but
there are other more important things that need to be done first, as
prerequisites. It is critically important to me to make the LLVM
transformations _implicitly_ update debug information as they do their
thing, without being aware of it. Just like the symbol table is
implicitly always kept up-to-date.

Of course, doing this is not easy. :wink:

-Chris

I've done a little more thinking about this (putting source definitions
into bytecode/assembler) files. Despite my previous assertions to the
contrary, using the LLVM type and constant global declarations in the
source definitions section is a little limiting because (a) it limits
the choices for source compiler writers, (b) it might imply a larger
amount of information than would otherwise be possible, and (c) it
implies a contract between LLVM and source compiler writers that LLVM
shouldn't have to support.

Again, reiterating from my more complete response in the previous mail,
I'm still thinking it is "too early" to support this. :slight_smile:

So, here's what I suggest:

     1. LLVM supports a named section that contains a BLOB. LLVM doesn't
        care about the contents of the BLOB but will assist in its
        maintenance (see below). To LLVM its a name and a chunk of data.
     2. The name of the section is to allow information from different
        compilers to be distinguished.
     3. LLVM provides a C++ class that represents the named blob. I'll
        call this class "ExtraInfo".
     4. Source compiler writers can subclass ExtraInfo to their heart's
        content.
     6. During compilation, any ExtraInfo subclasses created by the
        source compiler are attached to the Module object and the
        maintenance provided in 5 is invoked automatically as
        optimizations occur.

All of this is trivially implementable...

     5. The ExtraInfo class supports pure virtual methods that are
        invoked by LLVM to notify its subclass(es) when an optimization
        causes a function, type or global variable to be deleted. No
        other notifications should be necessary.

This sounds _extremely_ limited. What if the optimizer deletes arguments
to functions? Presumably your information will have to be updated, right?
What if it specializes the function because every caller has some
property?

-Chris

Yeah, you're right. I thought about it some more after I posted and it
boils down to ExtraInfo actually needing to be notified about EVERY
change. Not fun.

I've managed to get myself completely confused on this subject, but have
just shed some "old think" from previous compilers. The "old think" is
that the source level symbol stuff must be completely segregated from
the program itself (e.g. ELF debug sections).

But why? Why shouldn't the source level information be coded right into
the bytecode as part of the program? Why shouldn't it undergo
optimization? The only thing that I would need is a some kind of
linkage class or flag that says "never, ever delete this". If that was
the case, I could use the full expression of LLVM assembly language to
describe my source level information. No?

This being the case, I could emit a function in each module that returns
the source level information. To look at it, I just JIT load the module
and call the function. Any barriers to doing this?

Reid.

Yeah, you're right. I thought about it some more after I posted and it
boils down to ExtraInfo actually needing to be notified about EVERY
change. Not fun.

Yup.

I've managed to get myself completely confused on this subject, but have
just shed some "old think" from previous compilers. The "old think" is
that the source level symbol stuff must be completely segregated from
the program itself (e.g. ELF debug sections).

Heh, LLVM requires a little bit of that. :slight_smile:

But why? Why shouldn't the source level information be coded right into
the bytecode as part of the program? Why shouldn't it undergo
optimization? The only thing that I would need is a some kind of
linkage class or flag that says "never, ever delete this".

We already have never, ever, delete this flags. "weak" and external
linkage both guarantee that. If there can be an external caller of some
function, for example, the optimizer CANNOT delete it, nor can it change
it's interface.

If that was the case, I could use the full expression of LLVM assembly
language to describe my source level information. No?

I'm not sure exactly what you mean here, but yes, in principle, you should
be able to do exactly that. You are limited to the LLVM type system and
such, but if that is sufficient, yes.

This being the case, I could emit a function in each module that returns
the source level information. To look at it, I just JIT load the module
and call the function. Any barriers to doing this?

None at all!

-Chris

We already have never, ever, delete this flags. "weak" and external
linkage both guarantee that. If there can be an external caller of some
function, for example, the optimizer CANNOT delete it, nor can it change
it's interface.

Interesting, I wouldn't have assumed that given LLVM's support for IPO.
If you find that a function is never used when doing final linking of a
program, why not delete it? Just curious .. and off topic ..

> If that was the case, I could use the full expression of LLVM assembly
> language to describe my source level information. No?

I'm not sure exactly what you mean here, but yes, in principle, you should
be able to do exactly that. You are limited to the LLVM type system and
such, but if that is sufficient, yes.

I just mean that I could define LLVM types to define the structure of my
source level information. There actually isn't much that's needed. Some
tables, a few bit masks, a couple structures to group things together.
So, LLVM types are fine.

> This being the case, I could emit a function in each module that returns
> the source level information. To look at it, I just JIT load the module
> and call the function. Any barriers to doing this?

None at all!

Not so fast! I have a couple issues I need your clarification on:

      * The source level information could reference a function (in
        fact, it would reference all of them!). Could this reference to
        the function thwart optimization? Same for global variables,
        types, etc.
      * If I wanted to create a function named "GetModuleInfo" in every
        module, wouldn't that cause link time symbol redefinition and
        the resulting errors if it was "externally visible" linkage? If
        I use weak linkage I avoid that problem, but then can I call the
        function directly after JIT loading? Isn't it "internal" to the
        module?

(P.S. I read the entire discussion on "GCC tree linkage types" on the
GCC list and got myself totally confused about what is and isn't
"linkonce", etc. Had to go back to the LLVM Language Reference to
straighten myself out again :slight_smile:

> We already have never, ever, delete this flags. "weak" and external
> linkage both guarantee that. If there can be an external caller of some
> function, for example, the optimizer CANNOT delete it, nor can it change
> it's interface.

Interesting, I wouldn't have assumed that given LLVM's support for IPO.
If you find that a function is never used when doing final linking of a
program, why not delete it? Just curious .. and off topic ..

Because it's not safe and could change the semantics of the program. What
if you load a shared object that calls into the main program again: it
would break.

In practice the way we handle this is that we have an "internalize" pass
which marks all functions except main as 'internal'. This lets the IPO
machinery do all of the things you would expect to the program, including
breaking it if it works like the above. If this breaks a program, the
user can specify -disable-internalize on the linker command line.
Alternatively, they can specify a link-map to indicate exactly which
symbols should be exported. FWIW, the LLVM runtime libraries use
link-maps extensively.

> > If that was the case, I could use the full expression of LLVM assembly
> > language to describe my source level information. No?
>
> I'm not sure exactly what you mean here, but yes, in principle, you should
> be able to do exactly that. You are limited to the LLVM type system and
> such, but if that is sufficient, yes.

I just mean that I could define LLVM types to define the structure of my
source level information. There actually isn't much that's needed. Some
tables, a few bit masks, a couple structures to group things together.
So, LLVM types are fine.

Yeah, ok. Makes sense.

> > This being the case, I could emit a function in each module that returns
> > the source level information. To look at it, I just JIT load the module
> > and call the function. Any barriers to doing this?
>
> None at all!

Not so fast! I have a couple issues I need your clarification on:

      * The source level information could reference a function (in
        fact, it would reference all of them!). Could this reference to
        the function thwart optimization? Same for global variables,
        types, etc.

Yes. If you return a pointer to a function, for example, the optimizer
will have to assume that something outside of the module could call it...
If you represent the debug information as function bodies, also expect
that the body will be optimized as well.

      * If I wanted to create a function named "GetModuleInfo" in every
        module, wouldn't that cause link time symbol redefinition and
        the resulting errors if it was "externally visible" linkage? If
        I use weak linkage I avoid that problem, but then can I call the
        function directly after JIT loading? Isn't it "internal" to the
        module?

There is no good way to do this. The best way is to do what we do for
static ctor/dtors: each ctor results in a new internal function being
generated to the .bc file. Additinoally a pointer to this function is
added to an array with appending linkage. The end effect is that you get
"appending" semantics for functions.

(P.S. I read the entire discussion on "GCC tree linkage types" on the
GCC list and got myself totally confused about what is and isn't
"linkonce", etc. Had to go back to the LLVM Language Reference to
straighten myself out again :slight_smile:

Heh, I'm still trying to get the mapping of GCC types to LLVM types
perfect. :slight_smile:

-Chris

>
> * The source level information could reference a function (in
> fact, it would reference all of them!). Could this reference to
> the function thwart optimization? Same for global variables,
> types, etc.

Yes. If you return a pointer to a function, for example, the optimizer
will have to assume that something outside of the module could call it...
If you represent the debug information as function bodies, also expect
that the body will be optimized as well.

That's fine. I probably wouldn't return a pointer to function but a
pointer to structure that contains the information. Getting the function
that builds that structure optimized is just a bonus :slight_smile:

> * If I wanted to create a function named "GetModuleInfo" in every
> module, wouldn't that cause link time symbol redefinition and
> the resulting errors if it was "externally visible" linkage? If
> I use weak linkage I avoid that problem, but then can I call the
> function directly after JIT loading? Isn't it "internal" to the
> module?

There is no good way to do this. The best way is to do what we do for
static ctor/dtors: each ctor results in a new internal function being
generated to the .bc file. Additinoally a pointer to this function is
added to an array with appending linkage. The end effect is that you get
"appending" semantics for functions.

Yeah, I could use the same approach. A single array with appending
linkage would be all I needed to access every module's information. This
would work out well.

Thanks for your help on this. I think I can now see clearly a way to
retain my source level information directly in the byte code.