[lld] Handling a whole bunch of readers

Hi,

We have a whole bunch of readers(we would have some more too), and was thinking if we should have a vector of Readers, and have a function isMyFormat in each of them.

Any reader that knows to handle, goes ahead and parses the file.

On a side note, we currently use .objtxt as an figure out if the file is a YAML file or not. I have added FIXME's in the code, if we could some kind of magic (or) a better way to figure out if the file is YAML ?

Thanks

Shankar Easwaran

I guess in each isMyFormat(), you would check the given file’s magic using llvm::sys::fs::identify_magic(), and then check if it’s a known value for that reader. That would be repeated in each isMyFormat(), which is not very good.

I’d do that using a mapping from file magic to reader. I mean, we could call identify_magic() at some central place, look up the mapping, and then dispatch.

All files cannot be identified with a magic though. For example Linker scripts, and currently YAML files.

All files cannot be identified with a magic though. For example Linker
scripts, and currently YAML files.

You could do whatever to identify the file type. identify_magic() is one
way, checking the file extension is another. My point is that map-based
approach would be simpler than isMyFormat() approach.

On this topic, we should come up with standard file extension names. I made up .objtxt for atoms-in-yaml when writing the first test cases. We will soon need extensions for other kinds of yaml files (such as mach-o in yaml). With linker scripts we are stuck with there being no magic at the start and no standard file extension. For new yaml files that we are inventing we should define a standard file extension.

I agree with Rui that we should not be calling identify_magic() in every reader. Another approach is to just call each reader to try to parse until one succeeds. The first reader should be the native reader (e.g. ELF or mach-o), so that in the real world there is no time wasted looking at test case formats.

-Nick

We have a whole bunch of readers(we would have some more too), and was thinking if we should have a vector of Readers, and have a function isMyFormat in each of them.

Any reader that knows to handle, goes ahead and parses the file.

On a side note, we currently use .objtxt as an figure out if the file is a YAML file or not. I have added FIXME's in the code, if we could some kind of magic (or) a better way to figure out if the file is YAML ?

On this topic, we should come up with standard file extension names. I made up .objtxt for atoms-in-yaml when writing the first test cases. We will soon need extensions for other kinds of yaml files (such as mach-o in yaml). With linker scripts we are stuck with there being no magic at the start and no standard file extension. For new yaml files that we are inventing we should define a standard file extension.

Isnt having a YAML file starting with the below better, so that you dont need to go through file extensions.

magic :
arch:

You would also be able to figure out if the yaml file is a valid input for the flavor/target too.

I agree with Rui that we should not be calling identify_magic() in every reader. Another approach is to just call each reader to try to parse until one succeeds. The first reader should be the native reader (e.g. ELF or mach-o), so that in the real world there is no time wasted looking at test case formats.

Agree. This is better.

Thanks

Shankar Easwaran

Ping ?

Shankar Easwaran

We have a whole bunch of readers(we would have some more too), and was
thinking if we should have a vector of Readers, and have a function
isMyFormat in each of them.

Any reader that knows to handle, goes ahead and parses the file.

On a side note, we currently use .objtxt as an figure out if the file
is a YAML file or not. I have added FIXME's in the code, if we could some
kind of magic (or) a better way to figure out if the file is YAML ?

On this topic, we should come up with standard file extension names. I
made up .objtxt for atoms-in-yaml when writing the first test cases. We
will soon need extensions for other kinds of yaml files (such as mach-o in
yaml). With linker scripts we are stuck with there being no magic at the
start and no standard file extension. For new yaml files that we are
inventing we should define a standard file extension.

Isnt having a YAML file starting with the below better, so that you dont
need to go through file extensions.

magic :
arch:

I guess we will use a fixed file extension anyway (we probaly don't want to
use .txt for YAML object file for example), so what do you think is the
benefit of depending on special file magic compared to using file extension?

You would also be able to figure out if the yaml file is a valid input for

I would like to support usecases like this.

(a)
$ cat simple.s
          .text
          .global _start
          .type _start,@function
_start:
          callq bar
          ret

clang simple.s -c- -o- | lld -flavor gnu -target x86_64 --output-filetype=yaml -r - | lld -flavor gnu -target x86_64 -

Which is certainly not doable because you dont have a file created on the filesystem.

PS: This has been snipped from an earlier discussion with Tim.

(b)

lld -flavor gnu -target x86_64 input.o --output-filetype=yaml -o atoms.objtxt (Create a atom file using x86_64 target)
lld -flavor gnu -target hexagon atoms.objtxt (use it with hexagon)

You can create yaml files from each flavor and pass it to the wrong flavor, too.

If we have the magic and arch in the yaml file (or) one entry combined triple(that combines flavor, operating system and target) this would work and there is no need to create new types of extensions.

Thanks

Shankar Easwaran

-- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by the Linux Foundation

Isnt having a YAML file starting with the below better, so that you dont
need to go through file extensions.

magic :
arch:

I guess we will use a fixed file extension anyway (we probaly don't want
to
use .txt for YAML object file for example), so what do you think is the
benefit of depending on special file magic compared to using file
extension?

I would like to support usecases like this.

(a)
$ cat simple.s
         .text
         .global _start
         .type _start,@function
_start:
         callq bar
         ret

clang simple.s -c- -o- | lld -flavor gnu -target x86_64
--output-filetype=yaml -r - | lld -flavor gnu -target x86_64 -

Which is certainly not doable because you dont have a file created on the
filesystem.

This is an interesting example but I doubt the actual value of doing that,
especially because it cannot handle multiple input files. An alternative
command line would be

  lld -flavor gnu -target x86_64 <(clang simple.s -c- -o- | lld -flavor gnu
-target x86_64 --output-filetype=yaml -r -)

which could handle multiple inputs, but it works only on bash and may be
too tricky.

The compiler usually depends on the file extension to distinguish file
type, and your file has a non standard file extension, you can explicitly
specify the language type by -x*language* option. Adding a similar option
to the linker would be an option for us too.

Even if we go with a magic, I'd like to make it a simple magic comment,
such as "#!obj" at the beginning of a file. A magic string which is valid
as a YAML, such as "magic:" is IMO less flexible and should be avoided.

Isnt having a YAML file starting with the below better, so that you dont
need to go through file extensions.

magic :
arch:

I guess we will use a fixed file extension anyway (we probaly don't want
to
use .txt for YAML object file for example), so what do you think is the
benefit of depending on special file magic compared to using file
extension?

  I would like to support usecases like this.

(a)
$ cat simple.s
          .text
          .global _start
          .type _start,@function
_start:
          callq bar
          ret

clang simple.s -c- -o- | lld -flavor gnu -target x86_64
--output-filetype=yaml -r - | lld -flavor gnu -target x86_64 -

Which is certainly not doable because you dont have a file created on the
filesystem.

This is an interesting example but I doubt the actual value of doing that,
especially because it cannot handle multiple input files. An alternative
command line would be

   lld -flavor gnu -target x86_64 <(clang simple.s -c- -o- | lld -flavor gnu
-target x86_64 --output-filetype=yaml -r -)

which could handle multiple inputs, but it works only on bash and may be
too tricky.

The compiler usually depends on the file extension to distinguish file
type, and your file has a non standard file extension, you can explicitly
specify the language type by -x*language* option. Adding a similar option
to the linker would be an option for us too.

Even if we go with a magic, I'd like to make it a simple magic comment,
such as "#!obj" at the beginning of a file. A magic string which is valid
as a YAML, such as "magic:" is IMO less flexible and should be avoided.

YAML only allows key-value pairs. #!obj are all invalid characters for YAML files.

is a line comment chracter in YAML so it’s valid. That’s why I wrote a simple magic “comment”.

Ah Sorry. Totally forgot about that.

So I talked with Shankar on IRC on this topic, and here’s a suggestion.

  1. Use a magic comment to determine if it’s a YAML file. I’d propose “#!obj” as a YAML file magic because of similarity of Unix shebang. YAML reader skips this first line because it’s a comment line in YAML grammar.

  2. Add “target” field to YAML to represent what machine type the object type represents. For example, if a YAML object has “target: x86_64-linux-elf”, it’s decoded as a x86-64 Linux object file.

  3. Let the YAML reader to interpret “target” field to handle target-specific fields if needed.

I like this suggestion proposed by Ruiu. It makes the Reader flexible and can use a common infrastructure to verify files too.

Thanks

Shankar Easwaran

> We have a whole bunch of readers(we would have some more too), and was
thinking if we should have a vector of Readers, and have a function
isMyFormat in each of them.
>
> Any reader that knows to handle, goes ahead and parses the file.
>
> On a side note, we currently use .objtxt as an figure out if the file is
a YAML file or not. I have added FIXME's in the code, if we could some kind
of magic (or) a better way to figure out if the file is YAML ?

On this topic, we should come up with standard file extension names. I
made up .objtxt for atoms-in-yaml when writing the first test cases. We
will soon need extensions for other kinds of yaml files (such as mach-o in
yaml). With linker scripts we are stuck with there being no magic at the
start and no standard file extension. For new yaml files that we are
inventing we should define a standard file extension.

There are a handful of common ones. Checking if the name contains `.lds` or
`.ldscript` would probably cover most of the use cases. On the other hand,
GNU ld's behavior is to assume that anything that isn't recognized as a
binary object file (which are presumably identified by magic) is a linker
script, and we probably want to emulate that behavior (certainly we don't
want to assume that something passed to a GNU ld driver is YAML, unless
explicitly told so via a flag not present in GNU ld).

-- Sean Silva

So apparently I didn't reply all when I suggested this.

For determining which YAML reader to use, we should use YAML tags. This
allows multiple different types of input files in a single YAML stream.

!archive
<blah>

Hi,

We have a whole bunch of readers(we would have some more too), and was
thinking if we should have a vector of Readers, and have a function
isMyFormat in each of them.

Any reader that knows to handle, goes ahead and parses the file.

On a side note, we currently use .objtxt as an figure out if the file is
a YAML file or not. I have added FIXME's in the code, if we could some kind
of magic (or) a better way to figure out if the file is YAML ?

Thanks

Shankar Easwaran

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted
by the Linux Foundation

So apparently I didn't reply all when I suggested this.

For determining which YAML reader to use, we should use YAML tags. This
allows multiple different types of input files in a single YAML stream.

!archive
<blah>
---
!ELF
<blah>
---
!atoms
<blah>

For differentiating between linker scripts and YAML, I agree that some
form of comment magic is best.

Since our YAML stuff is all internal anyway, wouldn't it be simpler to just
hardcode the limited set of extensions we use for YAML files, and only do
that with non-emulated drivers unless explicitly asked to do so?

-- Sean Silva

Hi,

We have a whole bunch of readers(we would have some more too), and was
thinking if we should have a vector of Readers, and have a function
isMyFormat in each of them.

Any reader that knows to handle, goes ahead and parses the file.

On a side note, we currently use .objtxt as an figure out if the file is
a YAML file or not. I have added FIXME's in the code, if we could some kind
of magic (or) a better way to figure out if the file is YAML ?

Thanks

Shankar Easwaran

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted
by the Linux Foundation

So apparently I didn't reply all when I suggested this.

For determining which YAML reader to use, we should use YAML tags. This
allows multiple different types of input files in a single YAML stream.

Agree.

!archive
<blah>
---
!ELF
<blah>
---
!atoms
<blah>

I think <blah> has to be a key value pair here, which could be represented as target:<triple>

For differentiating between linker scripts and YAML, I agree that some
form of comment magic is best.

#!lld ? As lld would be interpreting this file ?

Since our YAML stuff is all internal anyway, wouldn't it be simpler to just
hardcode the limited set of extensions we use for YAML files, and only do
that with non-emulated drivers unless explicitly asked to do so?

That model would be difficult to maintain and we already have the YAML file as a avaialable form of an external output file (output-filetype=yaml).

On a sidenote, All of the readers would have a validation function that would check the architecture, which makes extensions highly unmanageable.

Thanks

Shankar Easwaran

Hi,

We have a whole bunch of readers(we would have some more too), and was
thinking if we should have a vector of Readers, and have a function
isMyFormat in each of them.

Any reader that knows to handle, goes ahead and parses the file.

On a side note, we currently use .objtxt as an figure out if the file is
a YAML file or not. I have added FIXME's in the code, if we could some
kind
of magic (or) a better way to figure out if the file is YAML ?

Thanks

Shankar Easwaran

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted
by the Linux Foundation

So apparently I didn't reply all when I suggested this.

For determining which YAML reader to use, we should use YAML tags. This
allows multiple different types of input files in a single YAML stream.

Agree.

!archive

<blah>
---
!ELF
<blah>
---
!atoms
<blah>

I think <blah> has to be a key value pair here, which could be

represented as target:<triple>

For differentiating between linker scripts and YAML, I agree that some
form of comment magic is best.

#!lld ? As lld would be interpreting this file ?

Since our YAML stuff is all internal anyway, wouldn't it be simpler to

just
hardcode the limited set of extensions we use for YAML files, and only do
that with non-emulated drivers unless explicitly asked to do so?

That model would be difficult to maintain and we already have the YAML
file as a avaialable form of an external output file (output-filetype=yaml).

Do we actually have users that rely on that feature? Is the YAML format is
stable enough for being exposed to users? It seems unwise to expose what is
effectively a debug/testing format to users.

-- Sean Silva