[PGO] Thoughts on adding a key-value store to profile data formats

Hi all,

I’d liked to get your thoughts on possibly adding a generic key-value store to the profile data formats for ‘metadata’. Some potential uses cases:

I. Profile Features

The most basic use could be as a central repository for internal bits of housekeeping information about the profile data. For example, to differentiate between FE and IR instrumentation:

llvm.instrumentation_source: “IR”

A key-value store would make it simple to add new bits of information and help keep everything human-readable for the text-based test formats. This could potentially also help with error checking at the llvm-profdata level if the Reader classes exposed it.

II. Profile Context

Basic (lightweight) information about the profile could be automatically gathered at profile time. The idea would be to automatically label profiles with contextual information so that the age/origin of a profile could be inspected using the llvm-profdata tool.

$ llvm-profdata show -metadata foo.profdata
llvm.profile_start_time: “2016-01-08T23:41:56.755Z”
llvm.profile_duration: 5.102s
llvm.exe_time: “2016-01-08T23:35:56.745Z”
Total functions: 4
Maximum function count: 866988873
Maximum internal block count: 267914296

Other possibilities: executable path, command line arguments, system info (uname)

III. Custom Content

The key-value store itself could be exposed to developers via the llvm-profdata tool. This would allow for users to associate arbitrary custom data with a profile, as well as inspect it:

$ llvm-profdata merge -metadata=customkey,value1 foo.profraw -o foo.profdata
$ llvm-profdata show -metadata foo.profdata
customkey: “value1”
Total functions: 4
Maximum function count: 866988873
Maximum internal block count: 267914296

Developers could add as much custom context as they find valuable:

$ llvm-profdata merge -metadata=“mysoft.version,${SOFTWARE_VERSION} (${BUILD_NUMBER})” -metadata="mysoft.exe_md5,md5 -q foo.exe foo.profraw -o foo.profdata
$ llvm-profdata show -metadata foo.profdata
mysoft.version: “0.1.0”
mysoft.exe_md5: “337b5c5bc29cbdca090a1921a58465d6”
Total functions: 4
Maximum function count: 866988873
Maximum internal block count: 267914296

Other information that might be interesting: git/svn revision, workload description, system info (uname -a)

This would be a way to embed almost any platform-specific or heavy-weight data without requiring the addition of platform-specific code in compiler-rt and without impacting other developers.

When profiles are merged it might be simplest to keep all input metadata (machine-readable things such as feature bits might need to be handled differently):

$ llvm-profdata merge -weighted-input=3,foo.profdata bar.profdata -o foobar.profdata
$ llvm-profdata show -metadata foobar.profdata
foo.profdata
llvm.profile_weight: 3
llvm.profile_start_time: “2016-01-08T23:41:56.755Z”
llvm.profile_duration: 5.102s
llvm.exe_time: “2016-01-08T23:35:56.745Z”
customkey: “value1”
bar.profdata
llvm.profile_weight: 1
llvm.profile_start_time: “2016-01-15T00:08:41.168Z”
llvm.profile_duration: “1.001s”
llvm.exe_time: “2016-01-15T00:08:13.000Z”
customkey: “value2”
Total functions: 4
Maximum function count: 866988873
Maximum internal block count: 267914296

In terms of implementation, the metadata could live as a separate contiguous section in the binary profile formats. It might make sense to encode it in something like YAML so that it could also be directly embedded in the various text formats.

Tagging profile data with such information is generally useful. My thoughts are

1) such information is probably not needed to be stored in raw format
profile data -- so no runtime changes are needed -- only llvm-profdata
and indexed format need to be enhanced to support this.
2) A more general way is just add an option:
--embed_label=<customized_label>, where the label is a string can be
key/value pairs encoded in user's favorite format. The format of the
key-value pairs are not specified and remain opaque to Instr/Sample
Profiler
3) labels from multiple source profiles will be merged when merge
command is used.

Hi all,

I'd liked to get your thoughts on possibly adding a generic key-value store
to the profile data formats for 'metadata'. Some potential uses cases:

I. Profile Features

The most basic use could be as a central repository for internal bits of
housekeeping information about the profile data. For example, to
differentiate between FE and IR instrumentation:

  llvm.instrumentation_source: "IR"

A key-value store would make it simple to add new bits of information and
help keep everything human-readable for the text-based test formats. This
could potentially also help with error checking at the llvm-profdata level
if the Reader classes exposed it.

This is ok to have, but I don't think the reader class should rely on
meta data to make decisions (as meta data can be thrown away without
affecting correctness). Formal approach such as the one proposed (to
encode it in variant bits of the version field) should be used.

II. Profile Context

Basic (lightweight) information about the profile could be automatically
gathered at profile time. The idea would be to automatically label profiles
with contextual information so that the age/origin of a profile could be
inspected using the llvm-profdata tool.

  $ llvm-profdata show -metadata foo.profdata
  llvm.profile_start_time: "2016-01-08T23:41:56.755Z"
  llvm.profile_duration: 5.102s
  llvm.exe_time: "2016-01-08T23:35:56.745Z"

Other examples include options and workload used in the training run.

  Total functions: 4
  Maximum function count: 866988873
  Maximum internal block count: 267914296

Other possibilities: executable path, command line arguments, system info
(uname)

yes.

III. Custom Content

The key-value store itself could be exposed to developers via the
llvm-profdata tool. This would allow for users to associate arbitrary custom
data with a profile, as well as inspect it:

  $ llvm-profdata merge -metadata=customkey,value1 foo.profraw -o
foo.profdata
  $ llvm-profdata show -metadata foo.profdata
  customkey: "value1"
  Total functions: 4
  Maximum function count: 866988873
  Maximum internal block count: 267914296

Developers could add as much custom context as they find valuable:

I think all meta data should be custom defined -- the profile reader
should not need to understand them.

  $ llvm-profdata merge -metadata="mysoft.version,${SOFTWARE_VERSION}
(${BUILD_NUMBER})" -metadata="mysoft.exe_md5,`md5 -q foo.exe` foo.profraw -o
foo.profdata
  $ llvm-profdata show -metadata foo.profdata
  mysoft.version: "0.1.0"
  mysoft.exe_md5: "337b5c5bc29cbdca090a1921a58465d6"
  Total functions: 4
  Maximum function count: 866988873
  Maximum internal block count: 267914296

Other information that might be interesting: git/svn revision, workload
description, system info (uname -a)

This would be a way to embed almost any platform-specific or heavy-weight
data without requiring the addition of platform-specific code in compiler-rt
and without impacting other developers.

yes.

When profiles are merged it might be simplest to keep all input metadata
(machine-readable things such as feature bits might need to be handled
differently):

Feature bits should not be part of it.

  $ llvm-profdata merge -weighted-input=3,foo.profdata bar.profdata -o
foobar.profdata
  $ llvm-profdata show -metadata foobar.profdata
  foo.profdata
    llvm.profile_weight: 3
    llvm.profile_start_time: "2016-01-08T23:41:56.755Z"
    llvm.profile_duration: 5.102s
    llvm.exe_time: "2016-01-08T23:35:56.745Z"
    customkey: "value1"
  bar.profdata
    llvm.profile_weight: 1
    llvm.profile_start_time: "2016-01-15T00:08:41.168Z"
    llvm.profile_duration: "1.001s"
    llvm.exe_time: "2016-01-15T00:08:13.000Z"
    customkey: "value2"
  Total functions: 4
  Maximum function count: 866988873
  Maximum internal block count: 267914296

In terms of implementation, the metadata could live as a separate contiguous
section in the binary profile formats. It might make sense to encode it in
something like YAML so that it could also be directly embedded in the
various text formats.

A single string after the header should do.

thanks,

David

Tagging profile data with such information is generally useful. My
thoughts are

1) such information is probably not needed to be stored in raw format
profile data -- so no runtime changes are needed -- only llvm-profdata
and indexed format need to be enhanced to support this.
2) A more general way is just add an option:
--embed_label=<customized_label>, where the label is a string can be
key/value pairs encoded in user's favorite format. The format of the
key-value pairs are not specified and remain opaque to Instr/Sample
Profiler
3) labels from multiple source profiles will be merged when merge
command is used.

> Hi all,
>
> I'd liked to get your thoughts on possibly adding a generic key-value
store
> to the profile data formats for 'metadata'. Some potential uses cases:
>
> I. Profile Features
>
> The most basic use could be as a central repository for internal bits of
> housekeeping information about the profile data. For example, to
> differentiate between FE and IR instrumentation:
>
> llvm.instrumentation_source: "IR"
>
> A key-value store would make it simple to add new bits of information and
> help keep everything human-readable for the text-based test formats. This
> could potentially also help with error checking at the llvm-profdata
level
> if the Reader classes exposed it.
>

This is ok to have, but I don't think the reader class should rely on
meta data to make decisions (as meta data can be thrown away without
affecting correctness). Formal approach such as the one proposed (to
encode it in variant bits of the version field) should be used.

We could potentially have a "reserved namespace" like `llvm.*` which tools
are not allowed to drop (or that have special handling inside tools).

Assuming that we have a semantics that guarantees that some
labels/"metadata" are kept (and that the compiler can communicate certain
predefined labels to the runtime which propagate back to the profraw and
then to the profdata), what do you think about using a generic format like
this for things like versions and profile source, rather than attempting to
fit everything in a small version field or having to come up with some
convention for a variable being defined or not (as in
http://reviews.llvm.org/D15540)? My impression is that it would give more
flexibility and potentially simplify compatibility.

-- Sean Silva

This scheme is more flexible but not necessarily simplifying
compatibility. We probably need more use cases in mind before we jump
into this flexibility (i.e passing arbitrary info from instrumentation
compile time to runtime and pass it back to profile-use in a round
trip). Note that we have 64 bits in version field -- and perhaps only
8 bits is actually needed for the actual version in reality so we have
lots of bits to use for this purpose. On the other hand, I think this
is also orthogonal to the other approach -- if we run out of bits some
day, we can always implement this.

The offline profiling tagging proposed by Nathan is useful to have
regardless of the above.

David

This scheme is more flexible but not necessarily simplifying
compatibility. We probably need more use cases in mind before we jump
into this flexibility (i.e passing arbitrary info from instrumentation
compile time to runtime and pass it back to profile-use in a round
trip). Note that we have 64 bits in version field -- and perhaps only
8 bits is actually needed for the actual version in reality so we have
lots of bits to use for this purpose. On the other hand, I think this
is also orthogonal to the other approach -- if we run out of bits some
day, we can always implement this.

It may be worth thinking about even now. I've seen multiple patches
recently that are using ad-hoc techniques to communicate with the runtime.
E.g. r257230 uses a hack due to not having an orthogonal way to set the
version and variant bits; the result is inferior diagnostic quality and
obscured code intent.

-- Sean Silva

Tagging profile data with such information is generally useful. My
thoughts are

1) such information is probably not needed to be stored in raw format
profile data -- so no runtime changes are needed -- only llvm-profdata
and indexed format need to be enhanced to support this.
2) A more general way is just add an option:
--embed_label=<customized_label>, where the label is a string can be
key/value pairs encoded in user's favorite format. The format of the
key-value pairs are not specified and remain opaque to Instr/Sample
Profiler

...

I think all meta data should be custom defined -- the profile reader
should not need to understand them.

OK. The benefit of enforcing some structure from the start is that it gives
us the the possibility of machine parsing/round trip of the content for
future applications. Initially this would just impact how we encode the
label content - the reader classes could still treat the content as opaque
for the time being if the format were something intended to be
human-readable like YAML. On the other hand, if the metadata content begins
life unstructured, it would be harder to retrofit structure later.

...
>
> In terms of implementation, the metadata could live as a separate
contiguous
> section in the binary profile formats. It might make sense to encode it
in
> something like YAML so that it could also be directly embedded in the
> various text formats.
>

A single string after the header should do.

For the text formats I'd suggest that we delimit the label information with
known prefix/suffix lines. That keeps it easy to parse (and skip) -
especially since the label content can be multiple lines. The delimiters
would only be a part of the file format and wouldn't be displayed from
llvm-profdata.

Tagging profile data with such information is generally useful. My
thoughts are

1) such information is probably not needed to be stored in raw format
profile data -- so no runtime changes are needed -- only llvm-profdata
and indexed format need to be enhanced to support this.
2) A more general way is just add an option:
--embed_label=<customized_label>, where the label is a string can be
key/value pairs encoded in user's favorite format. The format of the
key-value pairs are not specified and remain opaque to Instr/Sample
Profiler

...

I think all meta data should be custom defined -- the profile reader
should not need to understand them.

OK. The benefit of enforcing some structure from the start is that it gives
us the the possibility of machine parsing/round trip of the content for
future applications. Initially this would just impact how we encode the
label content - the reader classes could still treat the content as opaque
for the time being if the format were something intended to be
human-readable like YAML. On the other hand, if the metadata content begins
life unstructured, it would be harder to retrofit structure later.

Given that the information is mostly intended for human consumption, I
am not too worried about the 'unstructured' nature of it. In the end,
the reader can always extract it as a string and user can use his
favorite parser (be it regexp, or YAML) to process it. Until we find
more motivating examples that meta data need to be shared across
different tools (and therefore well defined format), we can leave it
undefined for now.

...
>
> In terms of implementation, the metadata could live as a separate
> contiguous
> section in the binary profile formats. It might make sense to encode it
> in
> something like YAML so that it could also be directly embedded in the
> various text formats.
>

A single string after the header should do.

For the text formats I'd suggest that we delimit the label information with
known prefix/suffix lines. That keeps it easy to parse (and skip) -
especially since the label content can be multiple lines. The delimiters
would only be a part of the file format and wouldn't be displayed from
llvm-profdata.

we can discuss this more in the patch review:)

David