Status of YAML IO?

Hey Nick, what's the status on YAML IO? The other thread seems to have died.

-- Sean Silva

I'm waiting on Michael Spencer's feedback.

The issues I know of right now are:

1) Should we structure YAML I/O to be a more general I/O utility that could support reading and writing other data formats such as JSON or plists. RIght now, all the code is in the llvm::yaml:: namespace. I we planned to generalize later, only the yaml specific parts should be in the llvm::yaml namespace.

You mentioned that YAML (1.2) is a superset of JSON. While that is true, there is some impedance mismatch: a) that does not help writing JSON documents, and b) there is the issue that JSON requires all strings to be quoted, whereas YAML does not.

2) Can the template meta programming used by YAML I/O work in C++03? My current implementation was written in C++11, but then back ported to C++03 in a way that works with clang, but may not work with other compilers. Here is an example:

template <class T>
struct has_output
{
private:
  static double test(...);
#if __has_feature(cxx_access_control_sfinae)
  template <class U, class X>
  static auto test(U& u, X& x)
    -> decltype(U::output(std::declval<const X&>(), std::declval<llvm::raw_ostream&>()), char());
#else
  template <class U, class X>
  static __typeof__(U::output(std::declval<const X&>(), std::declval<llvm::raw_ostream&>()), char()) test(U& u, X& x);
#endif
public:
  static const bool value = (sizeof(test(std::declval<ScalarTraits<T>&>(), std::declval<T&>())) == 1);
};

-Nick

To better understand how a client would use YAML I/O. I've completely rewritten the ReaderYAML and WriterYAML in lld to use YAML I/O. The source code is now about half the size. But more importantly, the error checking is much, much better and any time an attribute (e.g. of an Atom) is changed or added, there is just one place to update the yaml code instead of two places (the reader and writer).

This code requires no changes to the rest of lld. The traits based approach was thus non-invasive. It is able to produce yaml from existing data structures and when reading yaml recreate the existing data structures.

The example also shows how context sensitive yaml conversion is done, using io.getContext() to make conversion decisions.

The StringRef ownership works, but is a little clunky. Because one yaml stream can contain many documents, the ownership of the input file MemoryBuffer cannot be handed off to the newly created lld::File object (which would have allowed any StringRefs provided by the parse to be used as is). Instead whenever a trait needs to keep a StringRef it must make a copy of the underlying string, and the copies are owned by the generated lld::File object.

ReaderWriterYAML.cpp (41.6 KB)

To better understand how a client would use YAML I/O. I've completely rewritten the ReaderYAML and WriterYAML in lld to use YAML I/O. The source code is now about half the size. But more importantly, the error checking is much, much better and any time an attribute (e.g. of an Atom) is changed or added, there is just one place to update the yaml code instead of two places (the reader and writer).

Fantastic!

The StringRef ownership works, but is a little clunky. Because one yaml stream can contain many documents, the ownership of the input file MemoryBuffer cannot be handed off to the newly created lld::File object (which would have allowed any StringRefs provided by the parse to be used as is). Instead whenever a trait needs to keep a StringRef it must make a copy of the underlying string, and the copies are owned by the generated lld::File object.

Copying seems like a performance problem waiting to happen. Maybe this
could be addressed through reference counting? What are "typical"
sizes of strings that would be copied?

-- Sean Silva

To better understand how a client would use YAML I/O. I've completely rewritten the ReaderYAML and WriterYAML in lld to use YAML I/O. The source code is now about half the size. But more importantly, the error checking is much, much better and any time an attribute (e.g. of an Atom) is changed or added, there is just one place to update the yaml code instead of two places (the reader and writer).

Fantastic!

The StringRef ownership works, but is a little clunky. Because one yaml stream can contain many documents, the ownership of the input file MemoryBuffer cannot be handed off to the newly created lld::File object (which would have allowed any StringRefs provided by the parse to be used as is). Instead whenever a trait needs to keep a StringRef it must make a copy of the underlying string, and the copies are owned by the generated lld::File object.

Copying seems like a performance problem waiting to happen. Maybe this
could be addressed through reference counting?

There are no separate strings to reference count. There is just the one big MemoryBuffer which all the parsed StringRefs point into.

I don't think this is a general issue with YAML I/O. Most clients will not need to support multiple documents and will have a natural owner for the MemoryBuffer. The lld test cases uses multiple yaml documents because lld is a linker and links multiple files.

That said, I think a better implementation for lld's ReaderWriterYAML would be a BumpPtrAllocator per File to hold any strings it needs to copy.

What are "typical"
sizes of strings that would be copied?

Most values in key/value pairs are converted to some enum value, so a temporary StringRef into the MemoryBuffer is fine. The only ones that need to remain as strings are the atoms' names (e.g. "malloc").

-Nick

Michael,

To validate the refactor of YAML Reader/Writer using YAML I/O. I updated all the test cases to be compatible with YAML I/O. One issue that was a gnarly was how to handle the test cases with archives. Currently, we have test cases like:

Hi Nick,

I had a few questions :-

1) Is there a way to validate that the input file is of a valid format, thats defined by the YAML Reader ?
2) How are you plannning to represent section groups in the YAML ?
3) How are you planning to support Atom specific flags ? Is there a way already ?
     (This would be needed to group similiar atoms together)
4) Are you planning to support representing shared libraries too in this model ?
5) are you planning to support dwarf information too ?

Thanks

Shankar Easwaran

Do you mean different than if the yaml reader accepts it? Tons of files will be valid yaml syntactically. It is the semantic level checking that is hard, and that is what YAML I/O does.

You mean the ELF concept of section groups in YAML encoded ELF? The YAML encoding of ELF (or COFF or mach-o) does not know anything deeper about the meaning of the files. It is just the bytes from each section and the entries in the symbol table. If a section group is a section of bytes which are interpreted as an array of symbol/section indexes, then the ELF encoded YAML just has the raw bytes for the section.

It is still an open question how to support platform specific Atom attributes. As much as possible we’d like to expand the Atom model to be a superset of all the platform specific flags. But there are some attributes that are very much tied to one platform. One idea is to just add a new Reference which has no target but its kind (and maybe addend) encode the platform specific attributes. The Reference kind is already platform specific.

Yes, we already support shared library atoms in yaml.

Debugging information is another big open question. The dwarf format is very much tied to the section model. Not only is the debug information put is sections with special names, but the dwarf debug into references code by its address in the .o files (the Atom model does not model addresses). I’m sure the lldb guys have some ideas on direction of where they would like debug information to go. It may be that the Atom model has a different representation for debug info. And when generating a final linked image you can choose the debug format you want. A Writer could convert the debug info to dwarf if requested.

-Nick

Thanks for the reply Nick.

1) Is there a way to validate that the input file is of a valid format, thats defined by the YAML Reader ?

Do you mean different than if the yaml reader accepts it? Tons of files will be valid yaml syntactically. It is the semantic level checking that is hard, and that is what YAML I/O does.

Yes, if the YAML reader accepts it and figures out that its not the format what ReaderYAML needs.

2) How are you plannning to represent section groups in the YAML ?

You mean the ELF concept of section groups in YAML encoded ELF? The YAML encoding of ELF (or COFF or mach-o) does not know anything deeper about the meaning of the files. It is just the bytes from each section and the entries in the symbol table. If a section group is a section of bytes which are interpreted as an array of symbol/section indexes, then the ELF encoded YAML just has the raw bytes for the section.

Ok.

3) How are you planning to support Atom specific flags ? Is there a way already ?
    (This would be needed to group similiar atoms together)

It is still an open question how to support platform specific Atom attributes. As much as possible we'd like to expand the Atom model to be a superset of all the platform specific flags. But there are some attributes that are very much tied to one platform. One idea is to just add a new Reference which has no target but its kind (and maybe addend) encode the platform specific attributes. The Reference kind is already platform specific.

How about if the atom flags could be overridden ? The Atom flag could have a MIN/MAX and anything above the MAX or lower than the MIN are platform specific, like how its dealt with section indexes ?

4) Are you planning to support representing shared libraries too in this model ?

Yes, we already support shared library atoms in yaml.

Sorry didnt check that.

5) are you planning to support dwarf information too ?

Debugging information is another big open question. The dwarf format is very much tied to the section model. Not only is the debug information put is sections with special names, but the dwarf debug into references code by its address in the .o files (the Atom model does not model addresses). I'm sure the lldb guys have some ideas on direction of where they would like debug information to go. It may be that the Atom model has a different representation for debug info. And when generating a final linked image you can choose the debug format you want. A Writer could convert the debug info to dwarf if requested.

Wouldnt it be hard to get the source / line information right if the linker tries to write the debug information ?

Thanks

Shankar Easwaran

3) How are you planning to support Atom specific flags ? Is there a way already ?
   (This would be needed to group similiar atoms together)

It is still an open question how to support platform specific Atom attributes. As much as possible we'd like to expand the Atom model to be a superset of all the platform specific flags. But there are some attributes that are very much tied to one platform. One idea is to just add a new Reference which has no target but its kind (and maybe addend) encode the platform specific attributes. The Reference kind is already platform specific.

How about if the atom flags could be overridden ? The Atom flag could have a MIN/MAX and anything above the MAX or lower than the MIN are platform specific, like how its dealt with section indexes ?

I know ELF file format has some ranges for various values that are specifically reserved for processors or "user" defined functionality. It serves the needs of ELF well. It allows processor and software tools teams to use ELF but work independently (and/or in secret) on new functionality without needed to coordinate with a central ELF owner.

But lld is different. It is not a file format. It is an API. If a particular processor needs to express something not captured in the Atom model, we should discuss what that functionality is and see if we can grow the Atom model. There may well be another processor that needs some similar functionality. If we added a generic uint32_t DefinedAtom::flags() method, I would be concerned that lld porters would be quick to just use the bits for whatever they need and not see if the Atom model needs expanding.

An example of something I added (but am not happy with) is DefinedAtom::isThumb(). This is something only applicable to ARM (and only if you care about interop of thumb and arm code).

Given that the Reference::Kind field is already platform specific, I'm leaning towards saying that the way to add platform specific atom attributes is to add a Reference with no target to the Atom with a Kind field that for that platform means whatever attribute you need.

5) are you planning to support dwarf information too ?

Debugging information is another big open question. The dwarf format is very much tied to the section model. Not only is the debug information put is sections with special names, but the dwarf debug into references code by its address in the .o files (the Atom model does not model addresses). I'm sure the lldb guys have some ideas on direction of where they would like debug information to go. It may be that the Atom model has a different representation for debug info. And when generating a final linked image you can choose the debug format you want. A Writer could convert the debug info to dwarf if requested.

Wouldnt it be hard to get the source / line information right if the linker tries to write the debug information ?

Just as hard as reading and writing dwarf debug information in general :wink:

Let me also mention why the debug information is not an issue for MacOS/iOS. Dwarf is designed to work with "dumb" linkers or "smart" linkers. A dumb linker just copies all the dwarf sections from all input files to the output file, and applies any relocations. This is simple, but the resulting dwarf is huge with tons of "dead" dwarf in it (because of coalescing by the linker). A smart linker knows how to parse up dwarf and optimize the combining of sections. The resulting dwarf is much smaller, but it takes a lot of computation to do the merge.

When we (Apple/darwin) switched from stabs to dwarf years ago, we decided to take a different approach. We realized a dumb linker would be slow because of all the I/O copying dwarf. A smart linker would be slow because of all the computation needed. So, instead the darwin linker just ignores all dwarf in .o files! Instead it writes "debug notes" to the final linked image that lists the paths to all the .o files used to create the image. This approach makes linking fast. Next, if you happen to run the program in the debugger, the debugger would see the debug notes and go read the .o files' dwarf information. Lastly, if you are making a release build, you run a tool called dsymutil on the final linked image. dsymutil finds the debug notes, parses the .o files' dwarf information then does all the computation to produce an optimal dwarf output file (we use a .dSYM extension). Later, if you need to debug a release build, you point the debugger at the .dSYM file.

Perhaps the initial approach you should take for ELF is to go the dumb linker route. Have the ELF reader produce on Atom for each dwarf section with all the fixups/References needed. Then the ELF Writer will just concatenate those sections into the output file, and apply the fixups.

-Nick

Hi Nick,

Thanks for your reply.

How about if the atom flags could be overridden ? The Atom flag could have a MIN/MAX and anything above the MAX or lower than the MIN are platform specific, like how its dealt with section indexes ?

I know ELF file format has some ranges for various values that are specifically reserved for processors or "user" defined functionality. It serves the needs of ELF well. It allows processor and software tools teams to use ELF but work independently (and/or in secret) on new functionality without needed to coordinate with a central ELF owner.

I am not sure, if we will run out of possible enumerations for the individual atom flags across the many platforms :slight_smile:

But lld is different. It is not a file format. It is an API. If a particular processor needs to express something not captured in the Atom model, we should discuss what that functionality is and see if we can grow the Atom model. There may well be another processor that needs some similar functionality. If we added a generic uint32_t DefinedAtom::flags() method, I would be concerned that lld porters would be quick to just use the bits for whatever they need and not see if the Atom model needs expanding.

An example of something I added (but am not happy with) is DefinedAtom::isThumb(). This is something only applicable to ARM (and only if you care about interop of thumb and arm code).

Given that the Reference::Kind field is already platform specific, I'm leaning towards saying that the way to add platform specific atom attributes is to add a Reference with no target to the Atom with a Kind field that for that platform means whatever attribute you need.

To get the Atom flags, I felt the interface is

a) more cumbersome
b) leads to slow performance

5) are you planning to support dwarf information too ?

Debugging information is another big open question. The dwarf format is very much tied to the section model. Not only is the debug information put is sections with special names, but the dwarf debug into references code by its address in the .o files (the Atom model does not model addresses). I'm sure the lldb guys have some ideas on direction of where they would like debug information to go. It may be that the Atom model has a different representation for debug info. And when generating a final linked image you can choose the debug format you want. A Writer could convert the debug info to dwarf if requested.

Wouldnt it be hard to get the source / line information right if the linker tries to write the debug information ?

Just as hard as reading and writing dwarf debug information in general :wink:

Let me also mention why the debug information is not an issue for MacOS/iOS. Dwarf is designed to work with "dumb" linkers or "smart" linkers. A dumb linker just copies all the dwarf sections from all input files to the output file, and applies any relocations. This is simple, but the resulting dwarf is huge with tons of "dead" dwarf in it (because of coalescing by the linker). A smart linker knows how to parse up dwarf and optimize the combining of sections. The resulting dwarf is much smaller, but it takes a lot of computation to do the merge.

When we (Apple/darwin) switched from stabs to dwarf years ago, we decided to take a different approach. We realized a dumb linker would be slow because of all the I/O copying dwarf. A smart linker would be slow because of all the computation needed. So, instead the darwin linker just ignores all dwarf in .o files! Instead it writes "debug notes" to the final linked image that lists the paths to all the .o files used to create the image. This approach makes linking fast. Next, if you happen to run the program in the debugger, the debugger would see the debug notes and go read the .o files' dwarf information. Lastly, if you are making a release build, you run a tool called dsymutil on the final linked image. dsymutil finds the debug notes, parses the .o files' dwarf information then does all the computation to produce an optimal dwarf output file (we use a .dSYM extension). Later, if you need to debug a release build, you point the debugger at the .dSYM file.

This functionality was prevalent in other linkers like solaris ld, and hpux too.

Perhaps the initial approach you should take for ELF is to go the dumb linker route. Have the ELF reader produce on Atom for each dwarf section with all the fixups/References needed. Then the ELF Writer will just concatenate those sections into the output file, and apply the fixups.

Not sure, if the debuggers are going to understand the merged dwarf sections. The debugger may still need the optimized version. What do you think ?

Thanks

Shankar Easwaran

Hi Nick,

Thanks for your reply.

How about if the atom flags could be overridden ? The Atom flag could have a MIN/MAX and anything above the MAX or lower than the MIN are platform specific, like how its dealt with section indexes ?

Can you be more specific in your proposal. What method(s) would you add to the the Atom class(es)? I thought you were thinking of a generic uint32_t DefinedAtom::flags() method where the meaning of each bit in the uint32_t returned was platform specific. How can that work with a min/max range?

I know ELF file format has some ranges for various values that are specifically reserved for processors or "user" defined functionality. It serves the needs of ELF well. It allows processor and software tools teams to use ELF but work independently (and/or in secret) on new functionality without needed to coordinate with a central ELF owner.

I am not sure, if we will run out of possible enumerations for the individual atom flags across the many platforms :slight_smile:

But lld is different. It is not a file format. It is an API. If a particular processor needs to express something not captured in the Atom model, we should discuss what that functionality is and see if we can grow the Atom model. There may well be another processor that needs some similar functionality. If we added a generic uint32_t DefinedAtom::flags() method, I would be concerned that lld porters would be quick to just use the bits for whatever they need and not see if the Atom model needs expanding.

An example of something I added (but am not happy with) is DefinedAtom::isThumb(). This is something only applicable to ARM (and only if you care about interop of thumb and arm code).

Given that the Reference::Kind field is already platform specific, I'm leaning towards saying that the way to add platform specific atom attributes is to add a Reference with no target to the Atom with a Kind field that for that platform means whatever attribute you need.

To get the Atom flags, I felt the interface is

a) more cumbersome
b) leads to slow performance

If you always make every DefinedAtom have at least one Reference and the first Reference is always the extra flags for your platform, there will be no searching. You just look at the first Reference.

Can you give some concrete examples of flags you think you need. I'm imagining all that is needed is a few bits (like the isThumb for ARM). It seems like you think there will be lots.

-Nick

Hi Nick,

Can you be more specific in your proposal. What method(s) would you add to the the Atom class(es)? I thought you were thinking of a generic uint32_t DefinedAtom::flags() method where the meaning of each bit in the uint32_t returned was platform specific. How can that work with a min/max range?

The range of flags would be integers ranging from LOW_PROC .. HIGH_PROC.

The Generic flags would be within the range less than LOW_PROC and greater HIGH_PROC. Any value within the range LOW_PROC .. HIGH_PROC is os/platform specific.

What I was thinking was there could be a uint32_t flags() in the definedAtom which returns the flags, and platforms can act accordingly on the meaning of the flags in their pieces of code.

What do you think ?

Thanks

Shankar Easwaran

You still have not given an example of what information is missing in the current Atom model that is driving the need for this.

It sounds like your flags() returns a value - not a set of bits. Which means it can only be used for one thing. What if you need two or more kinds of information/attributes not in the Atom model? I don't see why LOW_PROC, HIGH_PROC is needed. If we decide there are new kinds of information/attributes that are general we would just define new methods on Atom, rather than define a value to be returned by flags().

-Nick

Hi Nick,

The range of flags would be integers ranging from LOW_PROC .. HIGH_PROC.

The Generic flags would be within the range less than LOW_PROC and greater HIGH_PROC. Any value within the range LOW_PROC .. HIGH_PROC is os/platform specific.

What I was thinking was there could be a uint32_t flags() in the definedAtom which returns the flags, and platforms can act accordingly on the meaning of the flags in their pieces of code.

What do you think ?

You still have not given an example of what information is missing in the current Atom model that is driving the need for this.

It sounds like your flags() returns a value - not a set of bits. Which means it can only be used for one thing. What if you need two or more kinds of information/attributes not in the Atom model? I don't see why LOW_PROC, HIGH_PROC is needed. If we decide there are new kinds of information/attributes that are general we would just define new methods on Atom, rather than define a value to be returned by flags().

There are two usecases that I can think of now :-

1) flags :- These are used to determine what the Atom contains in addition to the content, could be that the Atom has
     a) follow on reference
     b) atom is part of a group, where other atoms are part of

The flags could be used to determine if there is a follow up atom or if the atom is part of a group.
Both of them would be useful than iterating through the reference list and iterating it and figuring out if there is a follow on reference / atom being part of a group.

2) Atom specific content types

This is where the LOW_PROC, and HIGH_PROC comes in, there are content types which are architecture specific.

Currently there are many types defined within contentType which are operating system specific. As more environments start using lld, I feel that many architectures would want to add.

Example for GNU support would include

a) checksum
b) hash
c) gnu prelink library list
etc.

I believe both of them would be solvable, by using a 64bit unsigned integer, where in the lower half is used by content type and the upper is used by flags. I dont think we would need more than 32 flags anytime soon. But atleast there is a possibility of adding more flags.

I think flags should be supported only by lld Core.

Thanks

Shankar Easwaran

Hi Nick,

The range of flags would be integers ranging from LOW_PROC .. HIGH_PROC.

The Generic flags would be within the range less than LOW_PROC and greater HIGH_PROC. Any value within the range LOW_PROC .. HIGH_PROC is os/platform specific.

What I was thinking was there could be a uint32_t flags() in the definedAtom which returns the flags, and platforms can act accordingly on the meaning of the flags in their pieces of code.

What do you think ?

You still have not given an example of what information is missing in the current Atom model that is driving the need for this.

It sounds like your flags() returns a value - not a set of bits. Which means it can only be used for one thing. What if you need two or more kinds of information/attributes not in the Atom model? I don't see why LOW_PROC, HIGH_PROC is needed. If we decide there are new kinds of information/attributes that are general we would just define new methods on Atom, rather than define a value to be returned by flags().

There are two usecases that I can think of now :-

1) flags :- These are used to determine what the Atom contains in addition to the content, could be that the Atom has
   a) follow on reference
   b) atom is part of a group, where other atoms are part of

The flags could be used to determine if there is a follow up atom or if the atom is part of a group.
Both of them would be useful than iterating through the reference list and iterating it and figuring out if there is a follow on reference / atom being part of a group.

I see layout constraints (follow-on) and grouping as a natural use for References. Seems like your concern is just performance. I would wait and see if the searching of References for special kinds is actually a bottleneck in practice, then we can talk about was to improve the performance.

2) Atom specific content types

This is where the LOW_PROC, and HIGH_PROC comes in, there are content types which are architecture specific.

Currently there are many types defined within contentType which are operating system specific. As more environments start using lld, I feel that many architectures would want to add.

I've already added all the content types that darwin needs. I think it is fine for you to add any that ELF needs.

Example for GNU support would include

a) checksum
b) hash
c) gnu prelink library list

These are actually generic attributes that could be made into real attributes (methods) of DefinedAtom.

I've been thinking about adding something like a checksum for you in coalescing by-content (for instance coalescing duplication c-strings or other constants). For that, having a checksum would speed up comparisons.

I'm not sure if you mean a hash of the content or hash of the name. On the name side, I've had thoughts of reworking Atom::name() to return some new abstract type like SymbolName, instead of StringRef. The idea is that SymbolName maintains a hash for the string so equality checks are fast. It can also be used to help reduce the size of the new "native" format for object files of C++ code. With C++ (especially with namespaces) generates huge symbol names. A more compact format would be to factor out all the common substrings and use a dictionary coder. Thus in the native object file, each symbol name is some data stream of chars and dictionary indices.

I'm not sure what "gnu prelink library list" has to do with individual Atoms.

-Nick