Metadata in LLVM back-end

Hi,

Thank you all for keeping this going. Indeed I was not aware that the discussion was going on, I am really sorry for this late reply.

Nice to hear you again! Thank you for starting this thread :wink:

I understand Chris' point about metadata design. Either the metadata becomes stale or removed (if we do not teach transformations to preserve it), or we end up modifying many (if not all) transformations to keep the data intact.
Currently in the IR, I feel like the default behavior is to ignore/remove the metadata, and only a limited number of transformations know how to maintain and update it, which is a best-effort approach.
That being said, my initial thought was to adopt this approach to the MIR, so that we can at least have a minimal mechanism to communicate additional information to various transformations, or even dump it to the asm/object file.
In other words, it is the responsibility of the users who introduce/use the metadata in the MIR to teach the transformations they selected how to preserve their metadata. A common API to abstract this would definitely help, just as combineMetadata() from lib/Transforms/Utils/Local.cpp does.

Unfortunately, I never worked with the LLVM-IR Metadata (I almost focused on the back-end
and I just scratched the LLVM's middle-end), but I see your point.

Clearly, applying the needed modifications to all the back-end transformations/optimizations
is unfeasible and, probably, not worth it -- different users may have different requirements/needs
regarding a specific pass.

I like the idea of a common API to handle the MIR metadata, and let the end user handle
such data. Of course, if the community encounters common cases while handling the metadata, such
cases may be integrated with the upstream project.

Nonetheless, the main point of this thread is to preserve middle-end metadata down to the
back-end, right after the Instruction Selection phase. Hence, despite the need of the end user, a
"preserve-all" policy during the lowering stage is required, which will involve a bit of changes,
in particular in the DAGCombine pass.

As for my use case, it is also security-related. However, I do not consider the metadata to be a compilation "correctness" criteria: metadata, by definition (from the LLVM IR), can be safely removed without affecting the program's correctness.
If possible, I would like to have more details on Lorenzo's use case in order to see how metadata would interfere with program's correctness.

I would really like to discuss here the details, but, unfortunately, I am working on a publication
and, thus, I cannot disclose any detail here :frowning:

However, with "correctness" I do not refer to "I/O correctness", but the preservation of a
security property expressed in the front-end (e.g., specified in the source-code) or in the
middle-end (e.g., specified in the LLVM-IR, for instance by a transformation pass).

From a security point-of-view, removing or altering metadata does not interfere with the I/O
functionality of the code (although may impact on the performances), but may introduce
vulnerabilities.

As for the RFC, I can definitely try to write one, but this would be my first time doing so. But maybe it is better to start with Lorenzo's proposal, as you have already been working on this? Please tell me if you prefer me to start the RFC though.

It is the first time for me too, do not worry!

We could just use any other RFC as a template to get started :smiley:

I think that a structure like the following would be fine:

1. Background
1.1 Motivation
1.2 Use-cases
1.3 Other approaches
2. Goal(s)
3. Requirements
4. Drawbacks and main bottlenecks
5. Design sketch
6. Roadmap sketch
7. Potential future development

It may be a bit overkill; you are warmly invited to cut/refine these points!

And...no, I still have no sketch of the RFC; sorry, I had a bit of workload in these
days.

Yes, you can start the write up of the RFC.

Quoting David:

"Since you first raised the topic [...] I want to give you right of first refusal."

Have a nice day!

-- Lorenzo

Dear Tuan,

How are you doing? Did you manage to start the draft for the RFC?

I take this opportunity to wish you all the best for this new year :slight_smile:

Best regards,
Lorenzo Casalino

Did anyone send an RFC for this?

First-class metadata would be exceptionally useful for sanitizers and other dynamic tools. For
example, we want to construct PC-keyed metadata tables in the binary (without affecting the
generated code), to inform program behavior at runtime or to allow offline analysis. A
prerequisite is to actually propagate the metadata we need from the Clang frontend or LLVM
middle-end down to the assembly printer.

Our team has brainstormed many use cases:

  • GWP-TSan: storing PCs of accesses lowered from C++ atomics, to filter them out from race
    detection.
  • List
  • Map[callsite PC] → List
  • no_sanitize attributes: storing a map of functions that have the no_sanitize(“…”)
    attribute to the associated sanitizer, for filtering out from GWP-*San. Ideally we do not
    introduce new no_sanitize string literals, but simply rely on existing ones (e.g. a
    no_sanitize(“thread”) works for both TSan but also GWP-TSan).
  • Map[Func] → SanitizerKind
  • Fuzzing aid/CFG reconstruction: marking coverage PCs as function entry/exit or # of
    outgoing edges from BB (allows to find gaps in coverage frontier).

  • Type-aware malloc and heap profiling: enable the allocator to get the type for a given new
    call, to optimize for expected usage of the allocation.

  • Map[new callsite PC] → object type
  • Other: potential use cases for future bug-finding tools (GWP-assert, GWP-MSan,
    GWP-DFSan, GWP-UBSan).

First-class metadata would open the door to some really cool things.

Thanks,
Matt Morehouse

If you need PCs of certain key instructions, I suggest you take a look at MachineInstr::setPostInstrSymbol:
https://llvm.org/doxygen/classllvm_1_1MachineInstr.html#ac8ce95857a66b3706a84d1fd5072f0dd

This is used to track setjmp return addresses in CFG, for example. The feature isn’t really designed to put labels on arbitrary instructions, just things like calls or atomics that aren’t likely to be rewritten or replaced by later codegen passes. However, most of your use cases seem to just need return addresses, which is what this feature was made for.

Hello Matt,

I think that the RFC drafting went stale some months ago due to heavy workload on which all the partecipants were subject to.

As of now, I do not know when the RFC will be actually drafted and sent.

Cheers,
Lorenzo

Thanks Reid. setPostInstrSymbol is useful for getting PCs, but we still need a way to propagate the metadata we need down to the point where we can use setPostInstrSymbol (and further to the assembly printer, so we can actually encode the metadata in the binary). Things like function types, C++ object types, etc. that aren’t normally available in the backend.

Thanks for the update, Lorenzo.

I have some free time to work on an RFC, but I’m unfamiliar with how the implementation details would work.

If I dig through this thread and try to draft something, would you and/or Son be willing to contribute?

Thanks,
Matt

Hi all,

Thanks for resuscitating this discussion.

@Lorenzo please pardon me for dropping this for quite a while. It was indeed a tense period for me.

@Matt yes it’d be awesome if you can sketch an RFC, we can definitely iterate over to come up with more polished versions. I’d be more than happy to help in any way I can.

Hi all,

Thanks for resuscitating this discussion.

@Lorenzo please pardon me for dropping this for quite a while. It was indeed a tense period for me.

No problem, I know! ( Karine told me :wink: )

@Matt yes it’d be awesome if you can sketch an RFC, we can definitely iterate over to come up with more polished versions. I’d be more than happy to help in any way I can.

I agree with Son! If you nees any help, do not hesitate!

Thank you,
Lorenzo

Lorenzo Casalino via llvm-dev <llvm-dev@lists.llvm.org> writes:

I think that the RFC drafting went stale some months ago due to heavy
workload on which all the partecipants were subject to.

Indeed. In the interim I switched jobs and have been ramping up. I
am still very interested in this topic and will be happy to look over
an RFC.

                  -David