[RFC] Thoughts on a bitcode symbol table

This is about https://llvm.org/bugs/show_bug.cgi?id=27551.

Currently there is no easy way to get symbol information out of
bitcode files. One has to read the module and mangle the names. This
has a few problem

* During lto we have to create the Module earlier.
* There is no convenient spot to store flags/summary.
* Simpler tools like llvm-nm have massive dependencies because Object
depends on MC to find asm defined symbols.

To fix this I think we need a symbol table. The desired properties are

* Include the *final* name of symbols (_foo, not foo).
* Not be compressed so that be can keep StringRefs to the names.
* Be easy to parse without a LLVMContext.
* Include names created by inline assembly.
* Include other information a linker or nm would want: linkage,
visbility, comdat

The first question is: where should we store it? Some options I thought about:

* Use the existing support for putting bitcode in a section of a
native file and use the file's symbol table.
* Use a custom wrapper over the .bc
* Encode it with records/blocks in the .bc

The first option would be a bit annoying as we are sure to want to
represent more than the native files have. It is also a bit odd for
cross compiling. Do we create a MachO when the bitcode is for darwin
and an ELF when it is for Linux? It would also mean that llvm-as would
depend on a library to create these files.

The second option is tempting for parsing simplicity, but introduces
duplication as the names for regular global values would be stored
twice (once mangled, once not). The symbol table would also use a
string table, which is a concept I think would improve the .bc format.

So my current preference is for the last one. Encode the symbol table
in the .bc. This means that lib/Object will depend on BitReader, but
not more than that.

The next issue is what to do with .ll files. One option is to change
nothing and have llvm-as parse module level inline asm to crete symbol
entries. That would work, but sounds odd. I think we need directives
in the .ll so that symbols created or used by inline asm can be
declared.

Yet another issue is how to handle a string table in .bc. The problem
is not with the format, it is with StreamingMemoryObject. We have to
keep the string table alive while the rest of the file is read, and
the StreamingMemoryObject can reallocate the buffer.

I can think of two solutions

* Drop it. The one known user is PNaCl and it is moving to subzero, so
it is not clear if this is still needed.

* Change the representation so that each read is required to be
contiguous and not be freed. It would basically store a vector of
std::pair<offset, char*> and we would make sure the string table is
read as a blob in a single read.

With all that sorted, I think the representation can be fairly simple:

* a top level record stores the string table as a single blob. This
can be used for any string in the .bc, not just the symbol table.
* a sub block contains the symbol table with one record per symbol. It
would include an offset in the string table, the name size, the
linkage, etc. Being a record makes it easy to extend.

Cheers,
Rafael

Hi Rafael

Thanks for bringing this up. libObject linking libCore is something I’ve been hoping someone could find a way to fix.

The plan as you’ve described sounds good to me.

One thing I had considered when I looked at the code was whether it would make sense to have a base class in BitReader which can just read a SymbolicIRFile. In libObject, IRObjectFile inherits from SymbolFile as we only really want the symbols from it. It would be interesting to see if BitReader could mirror this. Then we could use the IR-less Symbolic BitReader from libObject to just crack the symbol table.

Anyway, not something we necessarily need immediately, but would be interesting to see if one day we can do more in BitReader without creating IR. I think this is what you were alluding to when you said you shouldn’t need an LLVMContext.

Cheers,
Pete

This is about https://llvm.org/bugs/show_bug.cgi?id=27551.

Currently there is no easy way to get symbol information out of
bitcode files. One has to read the module and mangle the names. This
has a few problem

This would be great for ThinLTO as well:

* During lto we have to create the Module earlier.

During the ThinLink step we could avoid creating the Module altogether,
only the parallel backends would need the Module.

* There is no convenient spot to store flags/summary.

Right now we are duplicating some info like the linkage type into the
summary since it isn't available in the ValueSymbolTable (which I assume
this would subsume?)

Thanks,
Teresa

It should yes. The general idea is for it to include any symbol info a
linker might want during resolution.

Cheers,
Rafael