[lld] Atom object model refactoring.

I've run into some issues with the current atom object model that I
would like to fix.

The current 4 atoms are not expressive enough. We need to be able to
serialize a larger set of atoms, many of which are format specific.

The set of common atoms (shared between all formats) should be the set
that the resolver requires to work. SharedLibrary is not included in
this (by looking at the source code).

The driving use case for this for me is the Import Address Table in
PE/COFF. It is a section created by the writer that specifies external
symbols to import and then acts as the GOT/PLT at runtime. Building
this table requires extra information to be maintained in an efficient
format. It also needs to be an atom so that relocations can point to
it. However it does not have a well defined size or content until the
table is complete.

The File interface for atoms should be changed to File::iterator
begin(); File::iterator end(); where File::iterator is some type of
iterator over Atom.

As for serialization. Each atom can have its own serialize/unserialize
function for both the Native format and YAML.

This would also change ContentType to not contain so many format
specific values. It would also allow us to get rid of isThumb as a
DefinedAtom level attribute.

- Michael Spencer

I've run into some issues with the current atom object model that I
would like to fix.

The current 4 atoms are not expressive enough. We need to be able to
serialize a larger set of atoms, many of which are format specific.

The set of common atoms (shared between all formats) should be the set
that the resolver requires to work. SharedLibrary is not included in
this (by looking at the source code).

The driving use case for this for me is the Import Address Table in
PE/COFF. It is a section created by the writer that specifies external
symbols to import and then acts as the GOT/PLT at runtime. Building
this table requires extra information to be maintained in an efficient
format. It also needs to be an atom so that relocations can point to
it. However it does not have a well defined size or content until the
table is complete.

Why is the IAT not just constructed in the PECOFF Writer? Why does it need
to be an Atom? What relocations need to point to it? If they are relocations
created by the Writer, you are fine. If you mean that other atoms may
have References (in)to the IAT, then that is what SharedLibraryAtoms are
for. They are place holders that expand to something real in the Writer.

Mach-o has all kinds of crazy data structures that are constructed in the Writer.
This is different than Darwin ld64 where the Writer actually created atoms for
its data structures and feed them back to the resolver. I wanted to avoid
that insanity in lld.

The Writer is handed a list of atoms from which to construct the executable.
It is free to create more atoms (private to the Writer) or just lay down data
structures - which ever is easier.

The File interface for atoms should be changed to File::iterator
begin(); File::iterator end(); where File::iterator is some type of
iterator over Atom.

As for serialization. Each atom can have its own serialize/unserialize
function for both the Native format and YAML.

The four Atom kinds each have very different attributes and are used
differently, That is why I broke them out into separate lists.

This would also change ContentType to not contain so many format
specific values. It would also allow us to get rid of isThumb as a
DefinedAtom level attribute.

I'm all for getting rid of isThumb(), but that seems orthogonal to your issue.

-Nick

I've run into some issues with the current atom object model that I
would like to fix.

The current 4 atoms are not expressive enough. We need to be able to
serialize a larger set of atoms, many of which are format specific.

The set of common atoms (shared between all formats) should be the set
that the resolver requires to work. SharedLibrary is not included in
this (by looking at the source code).

The driving use case for this for me is the Import Address Table in
PE/COFF. It is a section created by the writer that specifies external
symbols to import and then acts as the GOT/PLT at runtime. Building
this table requires extra information to be maintained in an efficient
format. It also needs to be an atom so that relocations can point to
it. However it does not have a well defined size or content until the
table is complete.

Why is the IAT not just constructed in the PECOFF Writer? Why does it need
to be an Atom? What relocations need to point to it? If they are relocations
created by the Writer, you are fine. If you mean that other atoms may
have References (in)to the IAT, then that is what SharedLibraryAtoms are
for. They are place holders that expand to something real in the Writer.

Mach-o has all kinds of crazy data structures that are constructed in the Writer.
This is different than Darwin ld64 where the Writer actually created atoms for
its data structures and feed them back to the resolver. I wanted to avoid
that insanity in lld.

The Writer is handed a list of atoms from which to construct the executable.
It is free to create more atoms (private to the Writer) or just lay down data
structures - which ever is easier.

The File interface for atoms should be changed to File::iterator
begin(); File::iterator end(); where File::iterator is some type of
iterator over Atom.

As for serialization. Each atom can have its own serialize/unserialize
function for both the Native format and YAML.

The four Atom kinds each have very different attributes and are used
differently, That is why I broke them out into separate lists.

[ Just trying to understand here. ]

So, what I'm hearing is that there are four different kinds of Atoms.
No more, no less - matching the enum in Atom.h.
Is that correct?

Stated that way, it makes the "four" seem arbitrary. It makes more sense once
you see that the four kinds are:

1) DefinedAtom
     95% of all atoms. This is a chunk of code or data
2) UndefinedAtom
     This is a place holder in object files for a reference to some atom outside the translation unit.
     During core linking it is usually replaced by (coalesced into) another Atom.
3) SharedLibraryAtom
      If a required symbol name turns out to be defined in a dynamic shared library (and not some
      object file). A SharedLibraryAtom is the placeholder Atom used to represent that fact.
      It is similar to an UndefinedAtom, but it also tracks information about the associated shared library.
4) AbsoluteAtom
     This is for embedded support where some stuff is implemented in ROM at some fixed address. This
      atom has no content. It is just an address that the Writer needs to fixup any references to point to.

The readers generate a list of atoms from some object format.
The linker does a bunch of graph stuff on the atoms.
The writers get a list of (interconnected) atoms, and write an executable from that.

Am I missing something?

That is the high level summary.

-Nick

Nick --

I've got no problem with the number "four" :wink:

I just want to make sure that I'm understanding what's going on - so I can make intelligent commentary.
thanks for the elaboration.

Can I paste that description into design.rst?

-- Marshall

Marshall Clow Idio Software <mailto:mclow.lists@gmail.com>

A.D. 1517: Martin Luther nails his 95 Theses to the church door and is promptly moderated down to (-1, Flamebait).
        -- Yu Suzuki