@David
Thanks for keeping this going, Lorenzo.
Lorenzo Casalino via llvm-dev <llvm-dev@lists.llvm.org> writes:
The first questions need to be “what does it mean?”, “how does it
work?”, and “what is it useful for?”. It is hard to evaluate a
proposal without that.
Hi everyone,
- "What does it mean?": it means to preserve specific information,
represented as metadata assigned to instructions, from the IR level,
down to the codegen phases.
An important part of the definition is "how late?" For my particular
uses it would be right up until lowering of asm pseudo-instructions,
even after regalloc and scheduling. I don't know whether someone might
need metadata even later than that (at asm/obj emission time?) but if
metadata is supported on Machine IR then it shouldn't be an issue.
"How late" it is context-specific: even in my case, I required such
information
to be preserved until pseudo instruction expansion. Conservatively, they
could be
preserved until the last pass of codegen pipeline.
Regarding their employment in the later steps, I would not say they are not
required, sinceI worked on a specific topic of secure compilation, and I do
not have the wholepicture in mind; nonetheless, it would be possible to
test how
things work out withthe codegen and later reason on future developments.
As with IR-level metadata, there should be no guarantee that metadata is
preserved and that it's a best-effort thing. In other words, relying on
metadata for correctness is probably not the thing to do.
Ok, I made a mistake stating that metadata should be *preserved*; what
I really meant is to preserve the *information* that such metadata
represent.
- "How does it work?": metadata should be preserved during the several
back-end transformations; for instance, during the lowering phase,
DAGCombine performs several optimization to the IR, potentially
combining several instructions. The new instruction should, then,
assigned with metadata obtained as a proper combination of the
original ones (e.g., a union of metadata information).
I want to make it clear that this is expensive to do, in that the number
of changes to the codegen pipeline is quite extensive and widespread. I
know because I've done it*.
It will help if there are utilities
people can use to merge metadata during DAG transformation and the more
we make such transfers and combinations "automatic" the easier it will
be to preserve metadata.
Once the mechanisms are there it also takes effort to keep them going.
For example if a new DAG transformation is done people need to think
about metadata. This is where "automatic" help makes a real difference.
* By "it" I mean communicate information down to late phases of codegen.
I don't have a "metadata in codegen" patch as such. I simply cobbled
something together in our downstream fork that works for some very
specific use-cases.
I know what you have been through, and I can only agree with you: for the
project I mentioned above, I had to perform several changes to the whole IR
lowering phase in order to correctly propagate high-level information;
it wasn't
cheap and required a lot of effort.
It might be possible to have a dedicated data-structure for such
metadata info, and an instance of such structure assigned to each
instruction.
I'm not entirely sure what you mean by this.
I was imagining a per-instruction data-structure collecting metadata info
related to that specific instruction, instead of having several metadata info
directly embedded in each instruction.
- "What is it useful for?": I think it is quite context-specific; but,
in general, it is useful when some "higher-level" information
(e.g., that canbe discovered only before the back-end stage of the
compiler) are required in the back-end to perform "semantic"-related
optimizations.
That's my use-case. There's semantic information codegen would like to
know but is really much more practical to discover at the LLVM IR level
or even passed from the frontend. Much information is lost by the time
codegen is hit and it's often impractical or impossible for codegen to
derive it from first principles.
To give an (quite generic) example where such codegen metadata may be
useful: in the field of "secure compilation", preservation of security
properties during the compilation phases is essential; such properties
are specified in the high-level specifications of the program, and may
be expressed with IR metadata. The possibility to keep such IR
metadata in the codegen phases may allow preservation of properties
that may be invalidated by codegen phases.
That's a great use-case. I do wonder about your use of "essential"
though.
With *essential* I mean fundamental for satisfying a specific target
security property.
Is it needed for correctness? If so an intrinsics-based
solution may be better.
Uhm...it might sound as a naive question, but what do you mean with
*correctness*?
My use-cases mostly revolve around communication with a proprietary
frontend and thus aren't useful to the community, which is why I haven't
pursued this with any great vigor before this.
I do have uses that convey information from LLVM analyses but
unfortunately I can't share them for now.
All of my use-cases are related to optimization. No "metadata" is
needed for correctness.
I have pondered whether intrinsics might work for my use-cases. My fear
with intrinsics is that they will interfere with other codegen analyses
and transformations. For example they could be a scheduling barrier.
I also have wondered about how intrinsics work within SelectionDAG. Do
they impact dagcombine and other transformations? The reason I call out
SelectionDAG specifically is that most of our downstream changes related
to conveying information are in DAG-related files (dagcombine, legalize,
etc.). Perhaps intrinsics could suffice for the purposes of getting
metadata through SelectionDAG with conversion to "first-class" metadata
at the Machine IR level. Maybe this is even an intermediate step toward
"full metadata" throughout the compilation.
I employed intrinsics as a mean for carrying metadata, but,
by my experience, I am not sure they can be resorted as a valid alternative:
- For each llvm-ir instruction employed in my project (e.g., store), a
semantically
equivalent intrinsic is declared, with particular parameters representing
metadata (i.e., first-class metadata are represented by specific
intrinsic's
parameters).
- During the lowering, each ad-hoc intrinsic must be properly handled,
manually
adding the proper legalization operations, DAG combinations and so on.
- During MIR conversion of the llvm-ir (i.e., mapping intrinsics to
pseudo-instructions),
metadata are passed to the MIR representation of the program.
In particular, the second point rises a critical problem in terms of
optimizations
(e.g., intrinsic store + intrinsic trunc are not automatically converted
into a
intrinsic truncated store).Then, the backend must be instructed to
perform such
optimizations, which are actually already performed on non-intrinsic
instructions
(e.g., store + trunc is already converted into a truncated store).
Instead of re-inventing the wheel, and since the backend should be
nonetheless
modified in order to support optimizations on intrinsics, I would rather
prefer to
insert some sort of mechanism to support metadata attachment as
first-class elements
of the IR/MIR, and automatic merging of metadata, for instance.