Peano: LLVM support for AMD/Xilinx AI Engine processors

On behalf of AMD, I’m pleased to announce the open sourcing of an LLVM backend for AMD/Xilinx AI Engine processors. (GitHub - Xilinx/llvm-aie: Fork of LLVM to support AMD AIEngine processors) These processors exist in a number of devices including RyzenAI SoCs.
The repository currently focuses on supporting the AIE2 architecture implemented by the XDNA accelerators in “Phoenix” and “Hawk Point” devices.
A simple flow for running code on these devices is documented here: E2E Linux Example · Xilinx/llvm-aie Wiki · GitHub
Note that these accelerators include an array of processors, while the LLVM backend only supports a single processor. Support for devices as a whole is available in open source tools based on MLIR (https://github.com/Xilinx/mlir-aie).
For more architecture information, see: AMD Technical Information Portal

In addition to support for LLVM code generation, the repository also includes support for Clang, LLD, binutils (e.g. ‘llvm-objdump’), Compiler-RT, and LLVM-LIBC.

Generally speaking, AI Engine processors are in-order, exposed-pipeline VLIW processors. Each VLIW instruction bundle specifies the behavior of one or more functional units,
which begin executing a new instruction at the same time. The processor pipeline does not include stall logic to enforce data dependencies, and instructions will continue
executing in order regardless of other instructions in the pipeline. As a result, the compiler is able to schedule machine instructions which access the same register in ways that potentially overlap.

Other key architectural characteristics include varying width instruction slots between different instruction encodings and relatively small address spaces (20-bit pointer registers).
The presence of varying-width instruction slots implies some code alignment restrictions for instructions which are branch or return targets.

In order to support the unusual architecture features of AI Engine, this repository adds LLVM support for several specific features:

  • support for non-power of 2 pointers;
  • improved TableGen support for specifying operand latencies and resource conflicts of exposed pipeline instructions;
  • scheduler support for negative operand latencies (i.e. an instruction writing to a register may be scheduled after a corresponding use);
  • scheduler support for slot assignment for instructions that can be issued in multiple VLIW slots;
  • support for selecting relocations for instructions with multiple encodings;
  • support for architectures with code alignment restrictions;
  • improved register allocation support for complex register hierarchies, specifically related to spills of sub-registers of large compound-registers;

We’d like to invite the community to comment on these approaches and would like to begin the process of upstreaming these generic improvements.
Currently, we are actively working on improving QOR and supporting the newest versions of the AIE architecture, in particular the XDNA2 accelerator in “Strix Point” devices.

The Peano Team

17 Likes

Great work Stephen! I didn’t realise these were shipping in the APUs, makes them very widely available.

@jhuber6 the rpc infra below libc is expected (by me anyway) to work on these. They’re on shared memory, multiple tiles claiming different ports is very like multiple wavefronts claiming different ports.

More ambitious and probably outside the scope of libc, rpc calls between the x64, gpu, aiengine arch are supposed to work together, probably including on a single shared buffer. That gives things like a GPU offloading a block of work to the aiengine.

1 Like

Thanks for this great work! Can you elaborate a bit on the negative latencies and how they are used effectively?

Hi Manuel,

We have some instructions with long latencies where writing a register happens after several cycles. Between the emission of the instruction, and the writing of the output register, one can do other things with the register.
Take this example: Here, the ld instruction will take 7 cycles to write to r12. Before this is written, one can use the initial value of the register for other purposes. In this case, it is multiplied, and then stored to memory. Only at the end of cycle 7, the new value of r12 will be “committed” by the ld instruction, and it is used in cycle 8.

1:   lda r12, [p0]     // writes r12 after cycle 7.
2:   nop
3:   nop
4:   mul r12, r12, r12 // reads r12 initial value and writes r12 after cycle 5.
5:   nop
6:   st  r12, [p1]     // reads r12 from instruction 4.
7:   nop 
8:   st  r12, [p1, 4]  // reads r12 from instruction 1.

On the implementation side, this “negative-latency scheduling” is taken care of by the PostMachineSchedulerPass. The latter will for example get an instruction sequence like this as input:

mul r12, r12, r12
st  r12, [p1]
lda r12, [p0]
st  r12, [p1, 4]

And it will turn it into what was showed above, where timing is strictly observed. In particular, you can notice in that there is a WAR dependency between st and lda . Given that st reads r12 in stage 1, and lda writes r12 in stage 7, this actually creates a dependency of -6 cycles. This is why after scheduling, those two instructions have been “reversed”.

Among lots of other things, this means the SchedGraph now has to be able to represent negative/signed latencies for the dependencies between instructions.

I’d be happy to give more insights if that is needed :slight_smile:

2 Likes

Ah, this is a really great example. Thank you! :slight_smile:

1 Like