RFC: XRay in the LLVM Library

Hi llvm-dev,

Recently, we've committed the beginnings of the llvm-xray [0] tool which allows for conveniently working with both XRay-instrumented libraries as well as XRay trace/log files. In the course of the review for the conversion tool [1] which turns a binary/raw XRay log file into YAML for human consumption purposes, a question arose as to how we intend to allow users to develop tools that deal with XRay traces (and the instrumentation maps in binaries).

As a bit of background, I've been working on the "flight data recorder" mode [2] for the XRay runtime library -- this mode lets the XRay instrumented binary to continuously write trace entries into an in-memory log, which is kept as a circular buffer of buffers [3]. FDR mode writes more concise records and has a different log format than the current "naive" logging implementation in compiler-rt (which continuously writes to disk as soon as thread-local buffers are full).

# Problem Statement

XRay has two key pieces of information that need to be encoded in a consistent manner: the instrumentation map embedded in binaries and the xray log files. However, we run into some issues when we change the encoding of this information over time either adding or removing information. This situation is very similar to how LLVM handles backwards compatibility with the bitcode format / versioning.

The problem we have is how to ensure that as we make changes to the data being output by the runtime library, that the tools handling this data are able to read them. A lot of factors play into this, which may be solved in many different ways (but is not the crux of this RFC):

- The split between the LLVM "core" library/tools and compiler-rt. This means we implement the writer in compiler-rt but implement the tools reading the traces in LLVM. We also have to coordinate any changes in LLVM for encoding new information in to the instrumentation map so that compiler-rt can take advantage of this new information.

- The potential for allowing user-defined additional information embedded in the XRay traces. We have ongoing projects that will add things like argument logging, and custom data logging, which will add information to the log without necessarily changing the "format" of the data.

# Potential Resolutions

Given the state we're at in XRay's development, we're looking at a few ways of going about the backwards/forwards compatibility of the instrumentation map and the xray log files, and the tools that will be written to read/manipulate them. We're seeking feedback on the following options and alternatives we may not have considered.

## Option A: Expose a Library that supports all known formats.

We can move out some currently tool-specific code for `llvm-xray extract` [0] that is able to ingest a binary with XRay instrumentation as something in (strawman proposal) lib/XRay (i.e. include/llvm/XRay/..., and implementation in lib/XRay/...), so that the tools become a thin wrapper around the functionality in this library. We can apply this to the `llvm-xray convert` core logic as well, to allow for loading all known/supported formats for the log file.

This option gives us the capability to provide a set of canonical implementations that can handle a set of file formats. This might introduce some complexity in parsing lots of known/supported formats (like YAML, compiler-emitted instrumentation maps for x86_64/arm7/aarch64/<insert platforms where XRay is yet to be ported>) in a library that not all tool writers actually need.

This option follows closely what the LLVM project does with backwards compatibility for parsing LLVM IR, applied to XRay instrumentation maps and traces.

## Option B: Expose a library that only supports one canonical format.

We can keep tool-specific code alongside the tools, but define one canonical format for the instrumentation map and traces -- as a specification document and a library implementation. This canonical format could be what we already have today which will make the log reading and instrumentation map handling library simple, and evolves only in case we extend/change the canonical format.

This means in the case for FDR mode traces, we'll have the conversion tool know about the FDR mode trace format/encoding and have a transformation from that to the canonical format. This means that the transformation logic will be localised to the conversion tool, while any other tool that builds upon and uses the reader library will not need to change. This also provides options for users defining their own log formats using the XRay library interfaces to install their own handlers to implement the transformations from their format to the XRay-canonical format in the tool without being tied to maintaining the released library version.

The evolution of the canonical format can happen more slowly and more conservatively than when new implementations of the XRay runtime is made available through compiler-rt.

# Open Questions

Some burning questions we'd like to get some thoughts on:

- Is there a preference between the two options provided above?
- Any other alternatives we should consider?
- Which parts of which options do you prefer, and is there a synthesis of either of those options that appeals to you?

Thanks in advance!

[0] - `llvm-xray extract` defined in https://reviews.llvm.org/D21987
[1] - `llvm-xray convert` being reviewed in https://reviews.llvm.org/D24376
[2] - FDR mode ongoing implementation (work in progress) at https://reviews.llvm.org/D27038
[3] - Buffer Queue implementation (work in progress) at https://reviews.llvm.org/D26232

-- Dean

- Is there a preference between the two options provided above?
- Any other alternatives we should consider?
- Which parts of which options do you prefer, and is there a synthesis of either of those options that appeals to you?

Hi Dean,

I haven't followed the XRay project that closely, but I have been
around file formats being formed and either of your two approaches
(which are pretty standard) will fail in different ways. But that's
ok, because the "fixes" work, they're just not great.

If you take the LLVM IR, there were lots of changes, but we always
aimed to have one canonical representation. Not just at the syntax of
each instruction/construct, but how to represent complex behaviour in
the same series of instructions, so that all back-ends can identify
and work with it. Of course, the second (semantic) level is less
stringent than the first (syntactical), but we try to make it as
strict as possible.

This hasn't come for free. The two main costs were destructive
semantics, for example when we lower C++ classes into arrays and
change all the access to jumbled reads and writes because IR readers
don't need to understand the ABI of all targets, and backwards
incompatibility, for example when we completely changed how exception
handling is lowered (from special basic blocks to special constructs
as heads/tails of common basic blocks). That price was cheaper than
the alternative, but it's still not free.

Another approach I followed was SwissProt [1], a manually curated
machine readable text file with protein information for cross
referencing. Cutting short to the chase, they introduced "line types"
with strict formatting for the most common information, and one line
type called "comment" where free text was allowed, for additional
information. With time, adding a new line type became impossible, so
all new fields ended up being added in the comment lines, with a
pseudo-strict formatting, which was (probably still is) a nightmare
for parsers and humans alike.

Between the two, the LLVM IR policy for changes is orders of magnitude
better. I suggest you follow that.

I also suggest you don't keep multiple canonical representations, and
create tools to convert from any other to the canonical format.

Finally, I'd separate the design in two phases:

1. Experimental, where the canonical form changes constantly in light
of new input and there are no backwards/forwards compatibility
guarantees at all. This is where all of you get creative and try to
sort out the problems in the best way possible.
2. Stable, when most of the problems were solved, and you now document
a final stable version of the representation. Every new input will
have to be represented as a combination of existing ones, so make them
generic enough. In need of real change, make sure you have a process
that identifies versions and compatibility (for example, having a
version tag on every dump), and letting the canonical tool know all of
the issues.

This last point is important if you want to continue reading old files
that don't have the compatibility issue, warn when they do but it's
irrelevant, or error when they do and it'll produce garbage. You can
also write more efficient converting tools.

From what I understood of this XRay, you could in theory keep the data

for years in a tape somewhere in the attic, and want to read it later
to compare to a current run, so being compatible is important, but
having a canonical form that can be converted to and from other forms
is more important, or the comparison tools will get really messy
really quickly.

Hope that helps,

cheers,
--renato

[1] http://web.expasy.org/docs/swiss-prot_guideline.html

Hi Dean,

I haven’t looked very closely at XRay so far, but I’m wondering if making CTF (common trace format, e.g. see http://diamon.org/ctf/) the default format for XRay traces would be useful?
It seems it’d be nice to be able to reuse some of the tools that already exist for CTF, such as a graphical viewer (http://tracecompass.org/) or a converter library (http://man7.org/linux/man-pages/man1/babeltrace.1.html).
LTTng already uses this format and linux perf can create traces in CTF format too. Probably it would be useful for at least some to be able to combine traces from XRay with traces from LTTng or linux perf?

Maybe the current version of CTF may not have all the features that you need, but the next version of CTF (CTF 2) seems to be at least addressing some of the concerns you touch on below: http://diamon.org/ctf/files/CTF2-PROP-1.0.html#design-goals.

Any thoughts on whether CTF could be a good choice as the format to store XRay logs in?

Thanks,

Kristof

- Is there a preference between the two options provided above?
- Any other alternatives we should consider?
- Which parts of which options do you prefer, and is there a synthesis of either of those options that appeals to you?

Hi Dean,

I haven't followed the XRay project that closely, but I have been
around file formats being formed and either of your two approaches
(which are pretty standard) will fail in different ways. But that's
ok, because the "fixes" work, they're just not great.

If you take the LLVM IR, there were lots of changes, but we always
aimed to have one canonical representation. Not just at the syntax of
each instruction/construct, but how to represent complex behaviour in
the same series of instructions, so that all back-ends can identify
and work with it. Of course, the second (semantic) level is less
stringent than the first (syntactical), but we try to make it as
strict as possible.

This hasn't come for free. The two main costs were destructive
semantics, for example when we lower C++ classes into arrays and
change all the access to jumbled reads and writes because IR readers
don't need to understand the ABI of all targets, and backwards
incompatibility, for example when we completely changed how exception
handling is lowered (from special basic blocks to special constructs
as heads/tails of common basic blocks). That price was cheaper than
the alternative, but it's still not free.

Another approach I followed was SwissProt [1], a manually curated
machine readable text file with protein information for cross
referencing. Cutting short to the chase, they introduced "line types"
with strict formatting for the most common information, and one line
type called "comment" where free text was allowed, for additional
information. With time, adding a new line type became impossible, so
all new fields ended up being added in the comment lines, with a
pseudo-strict formatting, which was (probably still is) a nightmare
for parsers and humans alike.

Between the two, the LLVM IR policy for changes is orders of magnitude
better. I suggest you follow that.

I also suggest you don't keep multiple canonical representations, and
create tools to convert from any other to the canonical format.

Thanks Renato! Just so I understand this one sentence (to disambiguate), you meant:

1) Don't have multiple canonical forms, just have one.
2) Create tools that will convert to/from that one canonical format.

I think this follows closely the Option B mental model that I had, with the only difference being the canonical reader is a library made part of LLVM "when it's ready", as you suggest later. Would that be accurate?

Finally, I'd separate the design in two phases:

1. Experimental, where the canonical form changes constantly in light
of new input and there are no backwards/forwards compatibility
guarantees at all. This is where all of you get creative and try to
sort out the problems in the best way possible.
2. Stable, when most of the problems were solved, and you now document
a final stable version of the representation. Every new input will
have to be represented as a combination of existing ones, so make them
generic enough. In need of real change, make sure you have a process
that identifies versions and compatibility (for example, having a
version tag on every dump), and letting the canonical tool know all of
the issues.

This last point is important if you want to continue reading old files
that don't have the compatibility issue, warn when they do but it's
irrelevant, or error when they do and it'll produce garbage. You can
also write more efficient converting tools.

I like this suggestion -- thanks!

So in essence we can treat the current implementation as experimental, and make that abundantly clear in any point release where XRay functionality will be included. Is there a clear place where this ought to be documented clearly (aside from the documentation at http://llvm.org/docs/XRay.html)?

XRay trace file headers already contain a version identifier, intended to precisely identify how a reader would interpret the data in there.

From what I understood of this XRay, you could in theory keep the data
for years in a tape somewhere in the attic, and want to read it later
to compare to a current run, so being compatible is important, but
having a canonical form that can be converted to and from other forms
is more important, or the comparison tools will get really messy
really quickly.

Yep, this is definitely one of the goals which is why we're being very careful about what we write down in the traces, optimising for efficient writing and smaller traces at the cost of potential complexity in the analysis tooling.

Hope that helps,

Definitely does, thanks again!

Cheers

-- Dean

Hi Dean,

I haven't looked very closely at XRay so far, but I'm wondering if making CTF (common trace format, e.g. see http://diamon.org/ctf/) the default format for XRay traces would be useful?

Nice! Thanks for mentioning this, I've not had a look at this before.

There's a couple issues I can think of, off the top of my head as to why using that as the default format for XRay may be slightly problematic. More on this below.

It seems it'd be nice to be able to reuse some of the tools that already exist for CTF, such as a graphical viewer (http://tracecompass.org/) or a converter library (http://man7.org/linux/man-pages/man1/babeltrace.1.html).
LTTng already uses this format and linux perf can create traces in CTF format too. Probably it would be useful for at least some to be able to combine traces from XRay with traces from LTTng or linux perf?

This sounds like a great idea!

I'm working on a conversion tool that aims to target multiple output formats. It's being developed at https://reviews.llvm.org/D24376 where the intent is to start with something simple, but then grow support for multiple other formats. CTF sounds like a perfectly reasonable target format.

Writing CTF though might be slightly problematic for XRay only for the potential complexity that will bring into the runtime library. While conceptually the formats are very similar (XRay uses binary logging, and efficient in-memory structures to save on both space and time required to write them down) we'd like the XRay library to make some more optimisations and evolve into a certain direction without being tied down to one particular format.

I'll need to think about this a little more, but I definitely think converting from whatever XRay format we come up with to CTF sounds like a great feature to the conversion tool.

Maybe the current version of CTF may not have all the features that you need, but the next version of CTF (CTF 2) seems to be at least addressing some of the concerns you touch on below: http://diamon.org/ctf/files/CTF2-PROP-1.0.html#design-goals.

Any thoughts on whether CTF could be a good choice as the format to store XRay logs in?

I may need to think about it more, but I don't see a bad reason for being able to convert XRay traces to CTF. :slight_smile:

Cheers

-- Dean

Thanks Renato! Just so I understand this one sentence (to disambiguate), you meant:

1) Don't have multiple canonical forms, just have one.
2) Create tools that will convert to/from that one canonical format.

Yup.

I think this follows closely the Option B mental model that I had, with the only difference being the canonical reader is a library made part of LLVM "when it's ready", as you suggest later. Would that be accurate?

Yup. :slight_smile:

So in essence we can treat the current implementation as experimental, and make that abundantly clear in any point release where XRay functionality will be included. Is there a clear place where this ought to be documented clearly (aside from the documentation at http://llvm.org/docs/XRay.html)?

I'd add this to the release change log, too. And whenever you decide
to make it stable, call it "1.0" or something and again, mention in
the change log that, from now on, it won't change unless it has to.

XRay trace file headers already contain a version identifier, intended to precisely identify how a reader would interpret the data in there.

Perfect!

cheers,
--renato

  • Is there a preference between the two options provided above?
  • Any other alternatives we should consider?
  • Which parts of which options do you prefer, and is there a synthesis of either of those options that appeals to you?

Hi Dean,

I haven’t followed the XRay project that closely, but I have been
around file formats being formed and either of your two approaches
(which are pretty standard) will fail in different ways. But that’s
ok, because the “fixes” work, they’re just not great.

If you take the LLVM IR, there were lots of changes, but we always
aimed to have one canonical representation. Not just at the syntax of
each instruction/construct, but how to represent complex behaviour in
the same series of instructions, so that all back-ends can identify
and work with it. Of course, the second (semantic) level is less
stringent than the first (syntactical), but we try to make it as
strict as possible.

This hasn’t come for free. The two main costs were destructive
semantics, for example when we lower C++ classes into arrays and
change all the access to jumbled reads and writes because IR readers
don’t need to understand the ABI of all targets, and backwards
incompatibility, for example when we completely changed how exception
handling is lowered (from special basic blocks to special constructs
as heads/tails of common basic blocks). That price was cheaper than
the alternative, but it’s still not free.

Another approach I followed was SwissProt [1], a manually curated
machine readable text file with protein information for cross
referencing. Cutting short to the chase, they introduced “line types”
with strict formatting for the most common information, and one line
type called “comment” where free text was allowed, for additional
information. With time, adding a new line type became impossible, so
all new fields ended up being added in the comment lines, with a
pseudo-strict formatting, which was (probably still is) a nightmare
for parsers and humans alike.

Between the two, the LLVM IR policy for changes is orders of magnitude
better. I suggest you follow that.

I also suggest you don’t keep multiple canonical representations, and
create tools to convert from any other to the canonical format.

Finally, I’d separate the design in two phases:

  1. Experimental, where the canonical form changes constantly in light
    of new input and there are no backwards/forwards compatibility
    guarantees at all. This is where all of you get creative and try to
    sort out the problems in the best way possible.
  2. Stable, when most of the problems were solved, and you now document
    a final stable version of the representation. Every new input will
    have to be represented as a combination of existing ones, so make them
    generic enough. In need of real change, make sure you have a process
    that identifies versions and compatibility (for example, having a
    version tag on every dump), and letting the canonical tool know all of
    the issues.

This last point is important if you want to continue reading old files
that don’t have the compatibility issue, warn when they do but it’s
irrelevant, or error when they do and it’ll produce garbage. You can
also write more efficient converting tools.

From what I understood of this XRay, you could in theory keep the data
for years in a tape somewhere in the attic, and want to read it later
to compare to a current run, so being compatible is important, but
having a canonical form that can be converted to and from other forms
is more important, or the comparison tools will get really messy
really quickly.

Not sure I quite follow here - perhaps some misunderstanding.

My mental model here is that the formats are semantically equivalent - with a common in-memory representation (like LLVM IR APIs). It doesn’t/shouldn’t complicate a comparison tool to support both LLVM IR and bitcode input (or some other hypothetical formats that are semantically equivalent that we could integrate into a common reading API). At least that’s my mental model.

Is there something different here?

What I’m picturing is that we need an API for reading all these formats and either we use that API only in the conversion tool - and users then have to run the conversion tool before running the tool they want. Or we sink that API into a common place, and all tools use that API to load inputs - making the user experience simpler (they don’t have to run an extra conversion step/tool) but it doesn’t seem like it should make the development experience more complicated/messy/difficult.

  • Dave

Not sure I quite follow here - perhaps some misunderstanding.

My mental model here is that the formats are semantically equivalent - with a common in-memory representation (like LLVM IR APIs). It doesn't/shouldn't complicate a comparison tool to support both LLVM IR and bitcode input (or some other hypothetical formats that are semantically equivalent that we could integrate into a common reading API). At least that's my mental model.

I think you mean 'conversion' instead of 'comparison', but having said that we cannot assume that semantic equivalence implies "cheapness". At least in FDR mode, the way the data is laid out in the file will be one fixed-sized chunk per thread's log. These chunks may be interleaved with each other, forming something like the following:

[ File Header ] [ <thread 1 buffer>, <thread 2 buffer>, <thread 1 buffer>, ... ]

While this can be converted to the current "naive" format:

[ File Header ] [ <record>, <record>, <record>, ... ]

N.B. Where <record> is a self-contained tuple of (tsc, cpu id, thread id, record type, function id, padding).

The process of doing so will be very expensive -- i.e. we'll have to denormalise the records per thread-buffer, expand out the TSCs, potentially load the whole FDR trace in memory, have multiple passes, etc. While we can certainly make that part be implemented as a library so that we can "support" this alternate format/representation, I'm not sure we want users using the library to pay for this cost in terms of storage and processing time if all they really want is to deal with an XRay trace.

The proposal is to keep the complexity involved with converting the FDR log format into the naive log format (both are binary, either one can have YAML analogues) in the conversion tool but only really support one canonical format (the naive one which could either be in YAML or binary) for the library that deals with this format.

Is there something different here?

What I'm picturing is that we need an API for reading all these formats and either we use that API only in the conversion tool - and users then have to run the conversion tool before running the tool they want. Or we sink that API into a common place, and all tools use that API to load inputs - making the user experience simpler (they don't have to run an extra conversion step/tool) but it doesn't seem like it should make the development experience more complicated/messy/difficult.

I think having the complexity of conversion be localised in the tool may be better, than consolidating that API into something that others might be able to use outside of the tools. For instance, if we're talking about converting XRay traces to other supported formats (like CTF, the Chrome Trace Viewer supported format, or <insert something else>) then I suspect we want to keep that in the conversion tool's implementation rather than making those routines part of the distributed XRay library. Or if a user wanted to be able to read XRay traces in their application, they should just be able to support the canonical format, and the conversion happen externally to keep the costs low.

The trade-off I'm thinking of is in the support burden not only in the development of the tools, but also of the exposed library that defines what the supported formats of the XRay trace files look like. I suspect that iterating on a tool and gaining support for multiple formats there, and keeping a log reading library simple as a released library in LLVM strikes that balance of not needing to support too many formats in the API/Library, while being able to support many formats in the conversion tool.

I'd think of the analogy here as the conversion tool being clang that supports more than one programming language source code as input, but using a canonical LLVM IR representation (in-memory, or written out). While LLVM can handle backwards compatibility of the LLVM IR, it doesn't have to worry about clang supporting a new programming language.

Does that make sense?

-- Dean

That's how I understood.

Multiple languages, with potentially different semantics (like C and
Fortran), not different representations of the same semantics (like
textual and binary IR).

While it's possible to convert both C and Fortran to IR, that's a
complicated design cost that we take because we have to. Where we can,
we decided not to pay that cost, ie. having a single IR semantic
model, and guaranteeing that its multiple representations are
consistent and unique.

The translation of multiple formats (languages) should really be
separated in different front-ends to the underlying model engine,
which should only deal with a single, broad and well defined
representation.

cheers,
--renato

Sorry for coming into this thread late.

I can see a few uses for different formats, but I’m not quite convinced on the usefulness of a universal exchange library. That said, if Dean really wants to implement a way of converting between all of these things I’m not going to stop him. I’d probably suggest just dumping some formats and using some sort of human readable format for input as a way of testing, but that’s just me.

-eric