RFC/bikeshedding: Separation of instruction and pattern definitions in LLVM backends

As many of you know, I have a growing series of patches for a RISC-V backend
under/awaiting review
<⚙ Query: Advanced Search,
<http://github.com/lowrisc/riscv-llvm&gt;\. I'll be posting a larger status update
on that work either later today or tomorrow, this RFC focuses on an issue that
came up during review which I think may benefit from wider input.

David Chisnall suggested that the backend could be made easier to read and
work with by separating out instruction definitions from the patterns used to
match them for codegen. One key advantage of such a separation is that
patterns and insruction definitions can be grouped and ordered independently
in the way that makes most sense. Patterns for addi and add benefit from being
grouped together, but it may make more sense to group the reg-reg and reg-imm
instruction definitions separately. The main downside is that this style is
not quite as concise as specifying patterns alongside the instruction
definition. Repetition seems to be effectively reduced with a few simple Pat
classes.

Semantic information such as isBranch is still represented in the instruction
definition meaning there isn't a complete split between MC-layer and codegen
concerns. The Mips{64,32}r6InstrInfo.td does also factor out this information.
This seems less compelling to me, but dissenting opinions are welcome!

I've demonstrated both the "conventional" approach
<https://gist.github.com/asb/0c61ebc131076c6186052c29968a491d#file-riscvinstrinfo_conventional-td&gt;
and the "separate patterns" approach
<https://gist.github.com/asb/0c61ebc131076c6186052c29968a491d#file-riscvinstrinfo_separate_pats-td&gt;\.
Obviously once patterns and pseudo-instructions are separated out, you may
want to move them to a different .td file.

Does anyone have strong views on these sort of choices one way or another?

Best,

Alex

I do find the second easier to follow, though both are a lot easier to read than the MIPS back end.

David

I agree with David’s sentiment. The second method appears to be easier to follow. IMHO, this would be easier for external users that desire to modify the backend for their own custom extensions/instructions.

Hi David and John, thanks for the feedback. I realise I didn't express
my view in the original email - I also feel that the second approach
works out as cleaner. As you say, it seems plausible that the
separation will make it easier to add new custom
extensions/instructions and we expect that to be a common activity.
The main reason I'm sharing for wider feedback is to ensure there
aren't further options that haven't been considered, as it's not going
to be painful to introduce such refactorings further down the line.

Best,

Alex

The 2nd approach does seem nicer: separates concerns, makes it easier to extend (especially important given RISC-V’s modular ISA). Although, as David said, both seem pretty clean. One thing I would add is not too worry about the conciseness too much initially: I have been “too clever” before when trying to write concise tablegen definitions, only to find that it needlessly complicated making changes and made it difficult to follow. Refactoring it to be more concise from a clear base is easier than the opposite IMHO.

Best,

Jacques

We have them separated on Hexagon for different reasons and it looks a lot cleaner that way.

-Krzysztof

I fully agree. Avoiding excessive verbosity, is useful, but trying to
"code golf" a backend description with ever more complex tablegen
magic isn't. If anything in the RISCV backend looks like it's getting
a bit too clever, please do call me out on it. Sometimes defining a
new class or multiclass can save a lot of repetition, though in these
cases so far I've been able to keep the class/metaclass definition
very close to the use so it's easy to understand what the following
definitions do without jumping about lots of different files.

Best,

Alex

One thing to be aware of with this is that (IIRC) tablegen uses the pattern to infer things about the pattern. One example I vaguely remember is that an empty pattern would result in the same effect as hasSideEffects=1 and I think there were others.

Semantic information such as isBranch is still represented in the instruction
definition meaning there isn't a complete split between MC-layer and codegen
The Mips{64,32}r6InstrInfo.td does also factor out this information.
This seems less compelling to me, but dissenting opinions are welcome!

Mips doesn't manage a complete split between the MC-layer and CodeGen either because there's overlap in the fields that they use. It's separating the encoding from the operation and the availability. The encoding and availability portions are focused on making them easily verifiable against the specification, while the operation section deals with the description of the behaviour of the instruction, how it maps to assembly, etc.

Thanks for the note - excellent point. Looking at
CodeGenDAGPatterns.cpp, it seems in the absence of a pattern
hasSideEffects will be 1, while mayLoad and mayStore default to 0.
Back in 2012, Jakob Stoklund Olesen added the
guessInstructionProperties flag, which causes an error
<https://reviews.llvm.org/rL162460&gt; if a property isn't set explicitly
and can't be inferred. It doesn't look like any other in-tree targets
have ended up enabling this, but it looks like it would be worth
enabling for RISCV, particularly if going ahead with splitting
instructions and patterns.

Best,

Alex

If no pattern is provided, does the implied 'hasSideEffects' take preference over an explicit 'let hasSideEffects = 0'?

I assume not, but I'd like to be certain.

Knowing about this default for a NULL pattern is important though, I was certainly unaware of this until this discussion and will have to revisit my existing definitions which have NULL patterns in case 'hasSideEffects' doesn't have the value I would expect for the instruction.

Thanks,

    MartinO

The explicit let hasSideEffects=0 takes preference. You can audit this
by looking at build/lib/Target/Mytgt/MytgtGenInstrInfo.inc. Look in
`extern const MCInstrDesc MytgtInsts`
You'll see `MCID::UnmodeledSideEffects` for any instruction where
hasSideEffects ended up being true (either explicitly specified or
inferred).

Best,

Alex

Thanks Alex,

Yes, that is what I expected the behaviour would be. So for the pattern-less definitions, it should be sufficient to wrap them in a 'let hasSideEffects = 0 in { ... }' unless of course the instruction really does have side-effects. Safer to be explicit anyway.

All the best,

  MartinO

To updated on this, I've found guessInstructionProperties requires the
standard TargetOpcode::* instructions to have their properties
explicitly set. https://reviews.llvm.org/D37065 does this. If anyone
has any insight in to why TargetOpcode::PHI must have hasSideEffects
set, that would be interesting.

Best,

Alex