LLVM Archive Format Extension Proposal

AMD would like to add new functionality to ranlib (and later ar and nm) and to the bits of LLVM Core that read (and later write) archives.
Herewith a terse summary of the change, which we want to improve support of OpenCL for multiple GPUs in a single run-time.

Conceptually, a serialized archive is really 2 pieces: a few header members and a set of normal file members. There are no constraints on the normal members in the ‘pure’ archive format. They could be text files, pictures, or, as we’re all familiar with, object modules. Most object file archives are “libraries” and the have a special header member that is a global symbol table, associating global scope names with defining object module members in the archive body.

We have N very large archives, defining essentially the same set of symbols. Many of the normal file members of each are duplicated in other archives, but not all. The goal is the produce a single “super-archive” that contains 1 copy of each unique object file member no matter how many archives it is part of, and N symbol table members representing each of the original N archives.

The symbol table for each original archive can properly index to the relevant members in the archive, even if other members in the super-archive (not referenced in this particular symbol table, of course) define the same symbols.

I’ve considered 3 approaches to the problem so far. All involve a new archive member type.

First, a new archive member type “up front” that describes each of the original archives and its symbol table.
Second, a normal/default symbol table member “up front” and a new archive member type that describes alternate symbol tables contained in the archive.
Third, a “hiding” archive member type that is essentially a way to “skip over” additional normal archive file headers to reach the first normal member, which (in all approaches) all archives share.

The third, I think, requires the least changes to the existing implementation, so I’m leaning towards it. The “hiding” archive member would have the “file name” of the represented archive immediately following the member header, followed by a completely normal archive representation starting with “<!arch>\n” and optionally including an additional “hiding” archive member covering even more hidden archives.

The plan is to extend the Archive class to provide for a way for clients to select a desired archive. I also will enhance ranlib to accept multiple archive names on the command line and produce the “super-archive” from ranlib.

A further need we have is to serialize the TOCs and the super-archive in a memory image (our archives are embedded in our DLL/SO, not stored separately on disk) and then provide an interface to the relevant LLVM classes (Linker, primarily) for accessing archives in memory rather than on disk, a feature absent from the current implementation.

For our purposes, extending the Archive class to support specification of the archive using a memory object instead of a file, recognizing the “hiding” member type, and extending ranlib to produce the new super archives is all we really need.

Any thoughts or suggestions would be welcome.

Thanks,
Richard

AMD would like to add new functionality to ranlib (and later ar and nm) and to the bits of LLVM Core that read (and later write) archives.
Herewith a terse summary of the change, which we want to improve support of OpenCL for multiple GPUs in a single run-time.

Conceptually, a serialized archive is really 2 pieces: a few header members and a set of normal file members. There are no constraints on the normal members in the ‘pure’ archive format. They could be text files, pictures, or, as we’re all familiar with, object modules. Most object file archives are “libraries” and the have a special header member that is a global symbol table, associating global scope names with defining object module members in the archive body.

We have N very large archives, defining essentially the same set of symbols. Many of the normal file members of each are duplicated in other archives, but not all. The goal is the produce a single “super-archive” that contains 1 copy of each unique object file member no matter how many archives it is part of, and N symbol table members representing each of the original N archives.

Let me see if I understand your need here. You are dynamically generating code and need to link it. The linking step requires some support routines which makes sense to have in a static library. Since this must work on machines not set up with developer tools, you are packaging the static library inside a DLL/DSO. In addition, with all the minor variations of GPUs, having a separate archive for every GPU type would be too large, so you need some way to remove duplicates of support functions.

If the above summary is close, then here are two other ideas that avoid the need for archive/TOC changes:

  1. Have lots of little archives which removes duplicates. Give each archive a unique name, then have a lookup table which lists which sequence of archives to use for which specific GPU. When linking for a particular GPU, you pass the linker that particular sequence of little archives to search.
  2. Have one giant archive for per GPU family and use name mangling scheme to filter. For instance, the compiler emits references to support routine “foo” as “foo$gpu1”. Then you construct the support libraries with aliases for each support function. So a particular implementation of foo in an archive may show up in the symbol table through its aliases “foo$gpu3”, “foo$gpu4”, “foo$gpu7”. When the linker is only looking for “foo$gpu1”, it will ignore all other foo implementations and just pick the one aliased to “foo$gpu1”.

BTW, Apple/Darwin has a similar issue with supporting multiple CPUs. Our solution is “fat” archives. The ranlib tool sorts all archive members by cpu, builds “thin” archive libraries for each cpu, then concatenates the thin libraries together with a “fat header” which specifies the file offset and size of each thin archive. The linker, when it comes across a fat static library, looks at the fat header thens seeks along to the contained thin archive relevant to the cpu type being linked. This scheme works well, but it can produce large files because there is no attempt to reduce duplicates.

-Nick

Note that I plan to remove llvm/Bitcode/Archive once Object/Archive is
capable of replacing it. The llvm tools that don't write archives
files have already been switched over to it. Object/Archive already
supports MemoryBuffer as a source for the data.

I support Nick's solution over extending the archive format
internally. I would also support adding a true wrapper. What I don't
like is not knowing that it's a special archive by just looking at the
file header.

- Michael Spencer

AMD would like to add new functionality to ranlib (and later ar and nm) and to the bits of LLVM Core that read (and later write) archives.
Herewith a terse summary of the change, which we want to improve support of OpenCL for multiple GPUs in a single run-time.

Conceptually, a serialized archive is really 2 pieces: a few header members and a set of normal file members. There are no constraints on the normal members in the ‘pure’ archive format. They could be text files, pictures, or, as we’re all familiar with, object modules. Most object file archives are “libraries” and the have a special header member that is a global symbol table, associating global scope names with defining object module members in the archive body.

We have N very large archives, defining essentially the same set of symbols. Many of the normal file members of each are duplicated in other archives, but not all. The goal is the produce a single “super-archive” that contains 1 copy of each unique object file member no matter how many archives it is part of, and N symbol table members representing each of the original N archives.

Let me see if I understand your need here. You are dynamically generating code and need to link it. The linking step requires some support routines which makes sense to have in a static library. Since this must work on machines not set up with developer tools, you are packaging the static library inside a DLL/DSO. In addition, with all the minor variations of GPUs, having a separate archive for every GPU type would be too large, so you need some way to remove duplicates of support functions.

Close. But it isn’t “support routines”. It’s the entire OpenCL “built-in” library… thousands and thousands of functions, some tiny, some large, some merely aliases.

If the above summary is close, then here are two other ideas that avoid the need for archive/TOC changes:

  1. Have lots of little archives which removes duplicates. Give each archive a unique name, then have a lookup table which lists which sequence of archives to use for which specific GPU. When linking for a particular GPU, you pass the linker that particular sequence of little archives to search.

This is close to what we have now, but find unworkable because the Linker doesn’t take a “set” of libraries. So if library A calls something in B calls something in A, you end up having to make multiple passes over the libraries, which isn’t efficient and creates other headaches. This is what we are trying to ‘solve’ by going to a single library, but we want to preserve the space advantage of the current approach.

  1. Have one giant archive for per GPU family and use name mangling scheme to filter. For instance, the compiler emits references to support routine “foo” as “foo$gpu1”. Then you construct the support libraries with aliases for each support function. So a particular implementation of foo in an archive may show up in the symbol table through its aliases “foo$gpu3”, “foo$gpu4”, “foo$gpu7”. When the linker is only looking for “foo$gpu1”, it will ignore all other foo implementations and just pick the one aliased to “foo$gpu1”.

I’ll have to think about this… We try to treat the IR emitted by the front-end as pretty family/device neutral to the extent that’s possible. We’d want to implement this as a pass between the linking of user-produced code and the subsequent linking of the OpenCL library. And we’d obviously have to create a tool that takes the current library members and aliases them to the ‘correct’ unique implementation.

BTW, Apple/Darwin has a similar issue with supporting multiple CPUs. Our solution is “fat” archives. The ranlib tool sorts all archive members by cpu, builds “thin” archive libraries for each cpu, then concatenates the thin libraries together with a “fat header” which specifies the file offset and size of each thin archive. The linker, when it comes across a fat static library, looks at the fat header thens seeks along to the contained thin archive relevant to the cpu type being linked. This scheme works well, but it can produce large files because there is no attempt to reduce duplicates.

Sounds pretty similar in terms of the problem… except we are motivated to avoid duplicates because a lot of the members are truly identical, whereas the “fat” archive probably has few identical members at the binary level because of instruction set and calling convention differences. This is most similar to my first considered approach.

Thanks,
Richard

AMD would like to add new functionality to ranlib (and later ar and nm) and
to the bits of LLVM Core that read (and later write) archives.
Herewith a terse summary of the change, which we want to improve support of
OpenCL for multiple GPUs in a single run-time.

Conceptually, a serialized archive is really 2 pieces: a few header members
and a set of normal file members. There are no constraints on the normal
members in the 'pure' archive format. They could be text files, pictures,
or, as we're all familiar with, object modules. Most object file archives
are "libraries" and the have a special header member that is a global symbol
table, associating global scope names with defining object module members in
the archive body.

We have N very large archives, defining essentially the same set of symbols.
Many of the normal file members of each are duplicated in other archives,
but not all. The goal is the produce a single "super-archive" that contains
1 copy of each unique object file member no matter how many archives it is
part of, and N symbol table members representing each of the original N
archives.

The symbol table for each original archive can properly index to the
relevant members in the archive, even if other members in the super-archive
(not referenced in this particular symbol table, of course) define the same
symbols.

I've considered 3 approaches to the problem so far. All involve a new
archive member type.

First, a new archive member type "up front" that describes each of the
original archives and its symbol table.
Second, a normal/default symbol table member "up front" and a new archive
member type that describes alternate symbol tables contained in the archive.
Third, a "hiding" archive member type that is essentially a way to "skip
over" additional normal archive file headers to reach the first normal
member, which (in all approaches) all archives share.

The third, I think, requires the least changes to the existing
implementation, so I'm leaning towards it. The "hiding" archive member would
have the "file name" of the represented archive immediately following the
member header, followed by a completely normal archive representation
starting with "<!arch>\n" and optionally including an additional "hiding"
archive member covering even more hidden archives.

The plan is to extend the Archive class to provide for a way for clients to
select a desired archive. I also will enhance ranlib to accept multiple
archive names on the command line and produce the "super-archive" from
ranlib.

A further need we have is to serialize the TOCs and the super-archive in a
memory image (our archives are embedded in our DLL/SO, not stored separately
on disk) and then provide an interface to the relevant LLVM classes (Linker,
primarily) for accessing archives in memory rather than on disk, a feature
absent from the current implementation.

For our purposes, extending the Archive class to support specification of
the archive using a memory object instead of a file, recognizing the
"hiding" member type, and extending ranlib to produce the new super archives
is all we really need.

Any thoughts or suggestions would be welcome.

Thanks,
Richard

Note that I plan to remove llvm/Bitcode/Archive once Object/Archive is
capable of replacing it. The llvm tools that don't write archives
files have already been switched over to it. Object/Archive already
supports MemoryBuffer as a source for the data.

I had meant to ask in my email about the apparent duplication of Archive in Bitcode and Object libs… Good to know. Since ranlib currently uses Bitcode, that's what I've been focusing on, but I had noticed the Object/Archive.h.

I support Nick's solution over extending the archive format
internally. I would also support adding a true wrapper. What I don't
like is not knowing that it's a special archive by just looking at the
file header.

I will think about Nick's second idea some more.
When you say a "true wrapper", are you suggesting a specific "first member" type that encapsulates the multiple symbol tables and serves as a flag? I'm not sure what you're getting at here, since the only "file header" an overall archive file has is the 8-character <!arch>\n signature. Everything else (string table, various kinds of symbol tables) are merely optional "special members", noted by their "special" filename. My "hiding" special member is of the same kind... I had tentatively chosen #_MULTI_SYMTAB_# as the special name of this "hiding" special member. The changes to the Bitcode Archive implementation to handling reading it is pretty simple and complete consistent with the existing "special member" handling code.

Writing these is much more complicated, though, which is why I was focusing on ranlib for prototyping. Once I had something working, I'd know better how or even whether to fold it in to the Archive class.

Thanks,
Richard

The gnu linker has the option:
–start-group archives –end-group
which tells the linker to keep searching the list of archives (not just make one pass). The darwin linker does this by default.

BTW, I had assumed you were using the “system linker”. But that won’t work if you are adding some new kind of archive table of contents. I guess that means you have some in-process code that provides linking functionality? If so, I’d like to understand this better to see if this is something lld <http://lld.llvm.org/> might want to support some day.

-Nick

Michael,
    I understand and agree that having 2 Archive implementations is something that should be fixed. Do you have a rough idea about when you might do the unification?
    Also, why unify around the Object/Archive implementation instead of the Bitcode/Archive implementation? What can the Object/Archive implementation "do" that can't be done with the Bitcode implementation?
    I ask because after looking at Archive in Object and Archive in Bitcode, the Archive in Bitcode seems much better documented than the Archive in Object, and feels (at least to me at first glance) like a somewhat better model of what Archives are. And as you've already noted, Object/Archive can't do writes...

Richard

Any thoughts or suggestions would be welcome.

Have you looked at thin archives? They are a gnu extension and support
an archive that has only references to object files and I *think* they
can be recursive.

Thanks,
Richard

Cheers,
Rafael

GNU thin archives can be nested, but they aren't suitable for our use. Thin archives are basically a symbol table member where the reference to the archive member inside the archive is replaced with a reference to the module/file in the filesystem. Our problem relates to archives that are a stream of bytes in memory… we explicitly want to avoid file system accesses, while avoiding duplicated modules.

But thanks for the suggestion. I hadn't considered it.

Richard

I wrote Object/Archive for a couple reasons. The main reason was
performance. Bitcode/Archive parses the entire archive file up front
including the symbol table. Object/Archive does it lazily and uses
much less memory. The other reason is that Bitcode/Archive is heavily
focused on bitcode files. It even requires an LLVMContext to
construct. This is was not optimal for my object file needs.

- Michael Spencer

Note that I plan to remove llvm/Bitcode/Archive once Object/Archive is
capable of replacing it. The llvm tools that don’t write archives
files have already been switched over to it. Object/Archive already
supports MemoryBuffer as a source for the data.

I had meant to ask in my email about the apparent duplication of Archive in Bitcode and Object libs… Good to know. Since ranlib currently uses Bitcode, that’s what I’ve been focusing on, but I had noticed the Object/Archive.h.

Michael,
I understand and agree that having 2 Archive implementations is something that should be fixed. Do you have a rough idea about when you might do the unification?
Also, why unify around the Object/Archive implementation instead of the Bitcode/Archive implementation? What can the Object/Archive implementation “do” that can’t be done with the Bitcode implementation?
I ask because after looking at Archive in Object and Archive in Bitcode, the Archive in Bitcode seems much better documented than the Archive in Object, and feels (at least to me at first glance) like a somewhat better model of what Archives are. And as you’ve already noted, Object/Archive can’t do writes…

Richard

I wrote Object/Archive for a couple reasons. The main reason was
performance. Bitcode/Archive parses the entire archive file up front
including the symbol table. Object/Archive does it lazily and uses
much less memory.

I guess performance measurement depends on use… For what ar and the Linker do, using some memory to structure the archive’s symbol table up front is worth the performance gain. As I read the Object/Archive implementation, each search for a symbol does a straight linear scan of the archive symbol table in it’s serialized form in memory… That’s probably not very performant for the archives I have in mind - hundreds of modules with many thousands of symbols.

There are two static methods in Bitcode/Archive… one does read the entire archive (OpenAndLoad), but the other reads only through the usual early members, including the symbol table (OpenAndLoadSymbols). The Linker class uses this API, while ar and ranlib uses OpenAndLoad.

The other reason is that Bitcode/Archive is heavily
focused on bitcode files. It even requires an LLVMContext to
construct. This is was not optimal for my object file needs.

I think we could refactor Bitcode/Archive relatively simply to avoid requiring the Context at construction time, and require it only for methods that want a Module returned from Archive. We could change the constructor to accept a pointer to a Context instead of a reference and use a default of 0. We could then add an optional parameter to the few operations that require a Context. And if at the point we actually need a Context, we still don’t have one, use getGlobalContext() to acquire one.

It feels like what’s desired is splitting Bitcode/Archive in two… a low-level, LLVM-neutral piece, and a higher-level, LLVM-dependent piece. Eliminating Context from the constructor goes part way to making Bitcode/Archive usable for low-level uses. Eliminating the “automatic” production of the symbol table requires a bit more thought, but should be doable without too much effort. There are really only a handful of public methods that depend on it, and they have a lot of overlap with those that need a Context. Maybe to support the slow, low-memory search of the symbol table member, we could add an additional method to do that. But I haven’t found any users of Object/Archive’s findSym() method yet, so maybe we don’t need that at all.

And, of course, the static method to produce an Archive from memory instead of a file, which is pretty straight forward.

What do you think?

Richard

We are using the Linker class, which uses the Bitcode/Archive class, which is why I was focusing on that class to support the new functionality.
Once I have Bitcode/Archive enhanced, and llvm-ar generates the new “multi-archives”, I was going to add a method to the Linker class to support linking “in-memory archives”… not strictly part of the Archive enhancements, but something we else we need.

Richard

AMD would like to add new functionality to ranlib (and later ar and nm) and to the bits of LLVM Core that read (and later write) archives.
Herewith a terse summary of the change, which we want to improve support of OpenCL for multiple GPUs in a single run-time.

Conceptually, a serialized archive is really 2 pieces: a few header members and a set of normal file members. There are no constraints on the normal members in the ‘pure’ archive format. They could be text files, pictures, or, as we’re all familiar with, object modules. Most object file archives are “libraries” and the have a special header member that is a global symbol table, associating global scope names with defining object module members in the archive body.

We have N very large archives, defining essentially the same set of symbols. Many of the normal file members of each are duplicated in other archives, but not all. The goal is the produce a single “super-archive” that contains 1 copy of each unique object file member no matter how many archives it is part of, and N symbol table members representing each of the original N archives.

Let me see if I understand your need here. You are dynamically generating code and need to link it. The linking step requires some support routines which makes sense to have in a static library. Since this must work on machines not set up with developer tools, you are packaging the static library inside a DLL/DSO. In addition, with all the minor variations of GPUs, having a separate archive for every GPU type would be too large, so you need some way to remove duplicates of support functions.

If the above summary is close, then here are two other ideas that avoid the need for archive/TOC changes:

  1. Have lots of little archives which removes duplicates. Give each archive a unique name, then have a lookup table which lists which sequence of archives to use for which specific GPU. When linking for a particular GPU, you pass the linker that particular sequence of little archives to search.
  2. Have one giant archive for per GPU family and use name mangling scheme to filter. For instance, the compiler emits references to support routine “foo” as “foo$gpu1”. Then you construct the support libraries with aliases for each support function. So a particular implementation of foo in an archive may show up in the symbol table through its aliases “foo$gpu3”, “foo$gpu4”, “foo$gpu7”. When the linker is only looking for “foo$gpu1”, it will ignore all other foo implementations and just pick the one aliased to “foo$gpu1”.

We’ve figured out that this doesn’t really work for us, unfortunately.
The OpenCL library has many, many functions that differ only in the size of the vector they accept. There’s sin for float, float2, float3, float4, float8, and float16, for example. And the doubles as well. We have a “common” vector expansion archive that defines most of these in terms of their smaller brethren, though we often have a x4 specific function. But the bottom-level x1 sin is often “per target GPU”. The R700 variant is different than the R800 variant is different than the SI variant. The conversion functions are similarly situated and have various rounding mode variants as well, which often alias.
As we currently have the library implemented, for any GPU family, we start by including the library specific to that GPU, then we include the next-newest family’s library, and so on, until we arrive at the “if all else fails” library, which is where the “default” conversions and vector expansion functions are defined. So a cos(float16) call would, currently, search, arrive at the bottom layer, resolve to a pair of cos(float8) calls, search starting at the top again, arrive at the bottom, resolve to a pair of cos(float4) calls, which may be defined in any one (or more than one) of the “per generation” libraries, and so on. Virtually ALL of the functions end up calling family-specific implementations somewhere along the line.
The net effect is that to discriminate between the cos(float16) that ends up resolving to the correct cos(float) for a particular GPU family means defining GPU-specific variations of ALL of the functions that are currently defined in the “common” library. This results in almost everything being GPU-specific, which wipes out any advantage at all… we might as well just have N completely separate libraries.

Richard