[lld] Representation of lld::Reference with a fake target

Hi,

I need an advice on implementation of a very specific kind of relocations
used by MIPS N64 ABI. As usual the main problem is how to pass target specific
data over Native/YAML conversion barrier.

In this ABI relocation record r_info field in fact consists of five subfields:
* r_sym - symbol index
* r_ssym - special symbol
* r_type3 - third relocation type
* r_type2 - second relocation type
* r_type - first relocation type

Up to three these relocations applied one by one. The first relocation uses
an addendum from the relocation record. Each subsequent relocation takes as
its addend the result of the previous operation. Only the final operation
actually modifies the location relocated. The first relocation uses as
a reference symbol specified by the r_sym field. The third relocation
assumes NULL symbol.

The most interesting case is the second relocation. It uses the special
symbol value given by the r_ssym field. This field can contain four
predefined values:
* RSS_UNDEF - zero value
* RSS_GP - value of gp symbol
* RSS_GP0 - gp0 value taken from the .MIPS.options or .reginfo section
* RSS_LOC - address of location being relocated

So the problem is how to store these four constants in the
lld::Reference object.
The RSS_UNDEF is obviously not a problem. To represent the RSS_GP value I can
set an AbsoluteAtom created for the "_gp" as the reference's target. But what
about RSS_GP0 and RSS_LOC? I am considering the following approaches but cannot
select the best one:

a) Create AbsoluteAtom for each of these cases and set them as the
reference's target.
   The problem is that these atoms are fake and should not go to the
symbol table.
   One more problem is to select unique names for these atoms.
b) Use two high bits of lld::Reference::_kindValue field to encode
RSS_xxx value.
   Then decode these bits in the RelocationHandler to calculate result
of relocation.
   In that case the problem is how to represent a relocation kind
value in YAML format.
   The simple xxxRelocationStringTable::kindStrings[] array will not satisfy us.
c) Add one more field to the lld::Reference class. Something like the
DefinedAtom::CodeModel
   field.

Any advices, ideas, and/or objections are much appreciated.

Yes, it is always a pain when we have to keep information over the round-trip conversion from/to Native. Native format is really able to handle only fields we define to base Atom or Reference and not a subclass of them. This is, well, hard. This is like a roadblock that is always there when we do something with Reference. I think this needs fixing.

I once proposed or at least suggested that we remove Native file format from LLD. I still think that’s not a bad idea at all – no one is using Native and thus delivers no value currently, but it imposes burden to developers. This is not a good situation.

If we still want to maintain Native, conversion from/to Native needs to be easier to support. I can think of a few reasons why supporting Native is hard.

  1. It is not extendable. It only recognizes base classes and not subclasses of Atom or References.

  2. The file format is designed to be mmap()'able, so that you can just mmap the entire file and cast a struct to read it. It’d be very fast indeed – but it can’t support any optional field or something.

If I would design from scratch, I’d probably use a serializable data structure like Protocol Buffers or Thrust to represent Atoms and References, so that all that conversion is automatically done. They are fast enough and easy to use.

That being said, I think the short term solution for your need is to just add a new field to Reference.

The only way currently is to create a new reference, unless we can think of adding some target specific metadata information in the Atom model.

This has come up over and over again, we need something in the Atom model to store information that is target specific.

Shankar Easwaran

Can we remove Native format support? I’d like to get input from anyone who wants to keep the current Native format in LLD.

I am fine with it. I hope you are not planning to remove YAML.

I’m not planning to remove YAML. YAML is important for testing.

Sounds good.

Is removing the native file format only for intentions that the format is not being tested and used ?

Just removing the native file format will not ease up the current situation as still the information needs to be encoded in YAML too that anything that the reader needs to pass to the writer has to be through references.

We need to consider an option in the Atom model to add target specific format.

Shankar Easwaran

That’s a good point. The answer is probably to remove all subclasses of Reference from LLD, so that we don’t worry how to teach the YAML reader/writer about how to deal with target-specific references.

We only have two subclasses of Reference, ELFReference and COFFReference. Looks like they don’t really add values to the base class, so it should be easy to remove the classes.

Can we remove Native format support? I'd like to get input from anyone who
wants to keep the current Native format in LLD.

One of the original goals for LLD was to provide a new object file
format for performance. The reason it is not used currently is because
we've yet to teach llvm to generate it, and we haven't done that
because it hasn't been finalized yet. The value it currently provides
is catching stuff like this, so we can fix it now instead of down the
road when we actually productize the native format.

As for the specific implementation of the native format, I'm open to
an extensible format, but only if the performance cost is low.

- Michael Spencer

There are two questions.

Firstly, do you think the on-disk format needs to compatible with a C++ struct so that we can cast that memory buffer to the struct? That may be super-fast but that also comes with many limitations. It’s hard to extend, for example. Every time we want to store variable-length objects we need to define string-table-like data structure. And I’m not very sure that it’s fastest – because mmap’able objects are not very compact on disk, slow disk IO could be a bottleneck, if we compare that with more compact file format. I believe Protobufs or Thrust are fast enough or even might be faster.

Secondly, do you know why we are dumping post-linked object file to Native format? If we want to have a different kind of object file format, we would want to have a tool to convert an object file in an existing file format (say, ELF) to “native”, and teach LLD how read from the file. Currently we are writing a file in the middle of linking process, which doesn’t make sense to me.

There are two questions.

Firstly, do you think the on-disk format needs to compatible with a C++
struct so that we can cast that memory buffer to the struct? That may be
super-fast but that also comes with many limitations. It's hard to extend,
for example. Every time we want to store variable-length objects we need to
define string-table-like data structure. And I'm not very sure that it's
fastest -- because mmap'able objects are not very compact on disk, slow disk
IO could be a bottleneck, if we compare that with more compact file format.
I believe Protobufs or Thrust are fast enough or even might be faster.

I'm not sure here. Although I do question if the object files will
even need to be read from disk in your standard edit/compile/debug
loop or on a build server. I believe we'll need real data to determine
this.

Secondly, do you know why we are dumping post-linked object file to Native
format? If we want to have a different kind of *object* file format, we
would want to have a tool to convert an object file in an existing file
format (say, ELF) to "native", and teach LLD how read from the file.
Currently we are writing a file in the middle of linking process, which
doesn't make sense to me.

This is an artifact of having the native format before we had any
readers. I agree that it's weird and not terribly useful to write to
native format in the middle of the link, although I have found it
helpful to output yaml. There's no need to be able to read it back in
and resume though.

Ideally lld -r would be the tool we use to convert COFF/ELF/MachO to
the native format.

- Michael Spencer

> There are two questions.
>
> Firstly, do you think the on-disk format needs to compatible with a C++
> struct so that we can cast that memory buffer to the struct? That may be
> super-fast but that also comes with many limitations. It's hard to
extend,
> for example. Every time we want to store variable-length objects we need
to
> define string-table-like data structure. And I'm not very sure that it's
> fastest -- because mmap'able objects are not very compact on disk, slow
disk
> IO could be a bottleneck, if we compare that with more compact file
format.
> I believe Protobufs or Thrust are fast enough or even might be faster.

I'm not sure here. Although I do question if the object files will
even need to be read from disk in your standard edit/compile/debug
loop or on a build server. I believe we'll need real data to determine
this.

>
> Secondly, do you know why we are dumping post-linked object file to
Native
> format? If we want to have a different kind of *object* file format, we
> would want to have a tool to convert an object file in an existing file
> format (say, ELF) to "native", and teach LLD how read from the file.
> Currently we are writing a file in the middle of linking process, which
> doesn't make sense to me.

This is an artifact of having the native format before we had any
readers. I agree that it's weird and not terribly useful to write to
native format in the middle of the link, although I have found it
helpful to output yaml. There's no need to be able to read it back in
and resume though.

Even for YAML it doesn't make much sense to write it to a file and read it
back from the file in the middle of the link, do it? I found that being
able to output YAML is useful too, but round-trip is a different thing. In
the middle of the process, we have bunch of additional information that
doesn't exist in input files and doesn't have to be output to the link
result. Ability to serialize that intermediate result is not useful.

Shankar, you added these round-trip tests. Do you have any opinion?

Ideally lld -r would be the tool we use to convert COFF/ELF/MachO to

> There are two questions.
>
> Firstly, do you think the on-disk format needs to compatible with a C++
> struct so that we can cast that memory buffer to the struct? That may be
> super-fast but that also comes with many limitations. It's hard to
> extend,
> for example. Every time we want to store variable-length objects we need
> to
> define string-table-like data structure. And I'm not very sure that it's
> fastest -- because mmap'able objects are not very compact on disk, slow
> disk
> IO could be a bottleneck, if we compare that with more compact file
> format.
> I believe Protobufs or Thrust are fast enough or even might be faster.

I'm not sure here. Although I do question if the object files will
even need to be read from disk in your standard edit/compile/debug
loop or on a build server. I believe we'll need real data to determine
this.

>
> Secondly, do you know why we are dumping post-linked object file to
> Native
> format? If we want to have a different kind of *object* file format, we
> would want to have a tool to convert an object file in an existing file
> format (say, ELF) to "native", and teach LLD how read from the file.
> Currently we are writing a file in the middle of linking process, which
> doesn't make sense to me.

This is an artifact of having the native format before we had any
readers. I agree that it's weird and not terribly useful to write to
native format in the middle of the link, although I have found it
helpful to output yaml. There's no need to be able to read it back in
and resume though.

Even for YAML it doesn't make much sense to write it to a file and read it
back from the file in the middle of the link, do it? I found that being able
to output YAML is useful too, but round-trip is a different thing. In the
middle of the process, we have bunch of additional information that doesn't
exist in input files and doesn't have to be output to the link result.
Ability to serialize that intermediate result is not useful.

Completely agree here. We should round-trip the input instead.

- Michael Spencer

>>
>> > There are two questions.
>> >
>> > Firstly, do you think the on-disk format needs to compatible with a
C++
>> > struct so that we can cast that memory buffer to the struct? That may
be
>> > super-fast but that also comes with many limitations. It's hard to
>> > extend,
>> > for example. Every time we want to store variable-length objects we
need
>> > to
>> > define string-table-like data structure. And I'm not very sure that
it's
>> > fastest -- because mmap'able objects are not very compact on disk,
slow
>> > disk
>> > IO could be a bottleneck, if we compare that with more compact file
>> > format.
>> > I believe Protobufs or Thrust are fast enough or even might be faster.
>>
>> I'm not sure here. Although I do question if the object files will
>> even need to be read from disk in your standard edit/compile/debug
>> loop or on a build server. I believe we'll need real data to determine
>> this.
>>
>> >
>> > Secondly, do you know why we are dumping post-linked object file to
>> > Native
>> > format? If we want to have a different kind of *object* file format,
we
>> > would want to have a tool to convert an object file in an existing
file
>> > format (say, ELF) to "native", and teach LLD how read from the file.
>> > Currently we are writing a file in the middle of linking process,
which
>> > doesn't make sense to me.
>>
>> This is an artifact of having the native format before we had any
>> readers. I agree that it's weird and not terribly useful to write to
>> native format in the middle of the link, although I have found it
>> helpful to output yaml. There's no need to be able to read it back in
>> and resume though.
>
>
> Even for YAML it doesn't make much sense to write it to a file and read
it
> back from the file in the middle of the link, do it? I found that being
able
> to output YAML is useful too, but round-trip is a different thing. In the
> middle of the process, we have bunch of additional information that
doesn't
> exist in input files and doesn't have to be output to the link result.
> Ability to serialize that intermediate result is not useful.

Completely agree here. We should round-trip the input instead.

Let me remove the round-trip passes. I'll send a patch for review, so let's
discuss there.

Doing it for every input file is not useful as some of the input files are not represent able in YAML form. Examples are shared libraries.

The reason I made the yaml pass be called before the writer was the intermediate result was more complete since all atoms have been resolved at that point and the state of all atoms are much sane.

It was also easy to use the pass manager. the code was very small to achieve what we are trying to do that all the information to the writer is passed through references or atom properties.

Shankar Easwaran

I think no one is opposing the idea of reading and writing YAML.

The problem here is that why we need to force all developers to write code to serialize intermediate data in the middle of link, which no one except the round-trip passes needs.

The intermediate result is what is really written to disk when --output-filetype=yaml or native is chosen too.

Writing to YAML/Reading back YAML is not doable when you convert input files to atoms because some of the input files are not representable in YAML format.

Not all input files have to be able to represented in YAML/Native format. There are many unrealistic use cases there. No one wants to write an executable file in Native because there’s no operating system that can run that file. So is YAML. So is the combination of .so file and Native/YAML unless we have an operating system whose loader is able to loads a YAML .so file.

We might want to write a Native/YAML file as a re-linkable object file (in GNU it’s -r option), but that’s an object file.

So it’s totally okay if some input file type is not representable in YAML/Native. Some use cases are not real. We can’t force all developers to spend their time to support unrealistic use cases.

My 2c: maybe we should not try to put all target specific object file
formats into the single YAML/Native representation. Let's define an
universal formats of file "header" for YAML/Native representation and
probably some top-level structures common for all target and allow
target specific code to arbitrary extend these formats. For example
code in the ReaderWriter/ELF will know how to convert ELF object files
into the YAML/Native form. In that case we get in fact some
incompatible YAML/Native formats for ELF, PECOFF, MachO etc. But I
think it is not a problem.

We are modeling target specific functionally using references, Doesn't your idea defeat the purpose of the atom model? Atoms are mostly target neutral and yaml/native format represents just an atom. Having a derived class for atoms will have a impact on the testing method with lld IMO.

We could continue to model using references in my opinion and add some meta data information in the atom where references are not able to model.