[RFC] Sandbox Vectorizer: An experimental modular vectorizer

vporpo · May 17, 2024, 5:44pm

Hi all, we have been working on an experimental vectorizer infrastructure with the aim of creating a more modular vectorizer that is simpler to extend and test. We would like your thoughts on it. Our goal is to work on this project upstream and maintain it alongside the existing vectorizers, without interfering with them.

Motivation for this work is the increasing complexity of the SLP vectorizer, which started as a fairly simple pass but has now become quite complex as more transformations have been added in, while still maintaining a relatively monolithic design.

This new vectorizer has three main design features:

Modular: The most important one is that we can split the vectorization pass into smaller sub-passes. Each such sub-pass implements a vectorization transformation, and when all are put together in a pipeline will give us a complete feature-rich vectorizer. Unlike the current SLPVectorizer, we get actual IR after each transformation. This has the benefit that we can test it in isolation with lit tests.
WYSIWYG Design and Transactional “Sandbox” IR: The main goal is a WYSIWYG (What You See Is What You Get) design where we apply transformations directly on the IR. This helps avoid using separate data-structures for holding the transformed IR state, simplifying the design and making it less error-prone.
This design necessitates a transactional IR (which we named “Sandbox IR”) for reverting the IR state if the transformation is found not to be profitable. Having a separate IR also helps with having more vectorization-specific abstractions. The vectorizer developer works with Sandbox IR only and cannot reach through and access LLVM IR. Sandbox IR is implemented as a layer on top of LLVM IR and is by design always automatically kept in sync with LLVM IR, avoiding unnecessary duplication of state and removing a common source of bugs.
Simple Transformations with Complex Infrastructure: We try to move as much complexity as possible from the vectorization transformations to the infrastructure. This is done in two ways:
(a) By implementing common vectorization multi-instruction patterns as single SandboxIR instructions, like Packs/Unpacks. This not only simplifies the transformations, but also moves the burden of matching/lowering such patterns down to the infrastructure.
(b) By moving the burden of keeping components up-to-date from the vectorizer writer to the pass infrastructure. Components like the DAG and the Scheduler that are being used throughout the vectorizer for legality checks etc., are kept up to date by the infrastructure itself.

We believe that these design features make it easier to develop new transformations and also help with testing as we can test components in isolation.

Vectorizer Overview

SBVec_Overview

The input and output of the vectorizer is SandboxIR:
The first step is creating SandboxIR from LLVM IR for the whole function.
Next we transform SandboxIR with a series of vectorization sub-passes, the first being a simple Bottom-Up vectorizer, and the rest applying vectorization transformations that try to improve the vectorized IR.
The final sub-pass checks for profitability and either accepts or reverts the IR to its original state.
Finally, we drop SandboxIR. Please note that there is no lowering phase. SandboxIR is always automatically synced with LLVM IR.
This pipeline runs for every vectorization region (see SBRegion discussed later) so it will run multiple times within a function.

Sandbox IR

To the pass designer, Sandbox IR is a self-sufficient IR. The input and output of the vectorization passes are SandboxIR, and you have no access to LLVM IR. Internally it is implemented as a thin layer on top of LLVM IR that is always in sync with it. Some of its features are:

Its IR class hierarchy mirrors LLVM IR, for example:
- Instruction → SBInstruction
- BasicBlock → SBBasicBlock
- Function → SBFunction etc.
Its API feels like LLVM IR, for example:
- cast/isa/dyn_cast<SBInstruction>(SBVal)
- SBUser::getOperand()
- SBValue::replaceAllUsesWith() etc.
The vectorizer extends it with some vectorization specific instructions, like SBPack/SBUnpack/SBShuffle, which help simplify the pass.
Internally Sandbox IR maps directly to LLVM IR. For most instructions there is a one to one correspondence. But Sandbox IR also supports some vectorization-specific instruction patterns that map to more than a single LLVM Instruction. An example of that is the SBPackInstruction that packs multiple values into a vector. It corresponds to a sequence of LLVM InsertElementInst in its simplest form.

Please note that even though SandboxIR is generic enough and could potentially be used by other passes, this is beyond the scope of this proposal.

Sandbox IR Design Principles

WYSIWYG (What You See Is What You Get)
- Modifying IR modifies the underlying LLVM IR directly, so dumping LLVM IR and reading it back always gives you the exact same state of SandboxIR.
No replication of LLVM IR state if we can avoid it:
- Provides use-def chains without replicating the data-structures. It relies on LLVM IR use-def chains internally. So to get the operand of a SBUser the infrastructure accesses the corresponding LLVM User, gets its LLVM Value operand through the chain, and then gets the corresponding SBValue operand, as shown below in steps 1 to 3.
  
  SBVec_Chains803×388 25.7 KB
- Reuse Type and Instruction-specific state (like Predicates, MathFlags etc.). None of it is replicated in Sandbox IR, we are using the state from LLVM IR.
What is replicated:
- The instruction opcode, because one LLVM IR opcode can form more than one type of Sandbox IR Instructions. For example an LLVM ShuffleInst could be either a SBShuffle or a SBPackInstruction, depending on the mask.
- The subclass ID used for isa<>, cast<> etc.
- The pointer to SBContext, as we can’t access it through LLVM’s Context pointer.
Feels-like LLVM IR:
- Similar class hierarchy:

//          +- SBArgument +- SBConstant     +- SBLoadInstruction   
//          |             |                 |                          
// SBValue -+- SBUser ----+- SBInstruction -+- SBStoreInstruction     
//          |                               |                          
//          +- SBBasicBlock                 +- SBPackInstruction
// ...                                      |

Stays close to the LLVM IR API by providing commonly used functions e.g., getOperand(), setOperand(), replaceAllUsesWith(), etc.
Same cast<>/isa<>/dyn_cast<> type system.

Sandbox IR Features

Provides internal APIs for tight-coupling with other components, i.e., allows tracking instruction movement, creation, deletion through callbacks.
Even though it is not serializable in its SandboxIR form, the mapping from LLVM IR to SandboxIR (and the reverse) is straightforward so we can use LLVM IR textual form in our tests.
It tries hard to not provide access to the underlying LLVM IR, aiming at protecting the user from breaking things unintentionally.
Since it is always synced to LLVM IR, it makes it relatively easy to use some existing Analyses internally and provide wrappers for them.
It is transactional, providing a save()/accept()/restore() API. In our initial implementation the state is saved at the beginning of the pass pipeline and accepted/restored by a pass at the end. But as we extend the implementation this could be useful within the transformations as well.

Sandbox IR Transactions

A major feature of SandboxIR is that it allows us to save and rollback its state (and the underlying LLVM IR’s).

This was proposed as a feature on LLVM IR ( Checkpointing RFC ), but the scope was too large, so rolling back all state was not always possible. With SandboxIR we reduce the scope, and limit the user’s access to objects that can be tracked. This makes checkpointing feasible within the context of SandboxIR.

A transactional IR has the following advantages:

It enables simpler designs where you first transform the IR and then decide if it is worth keeping it, instead of operating on abstractions of the IR before lowering them to IR.
- Simpler designs: no need to create/maintain parallel data-structures to operate on
- May lead to more accurate cost-modeling in the future as we can check the cost of actual IR
Since it already implements the functionality of tracking IR changes, it can be used to enable tight coupling with other components via callbacks.

However, it also has some drawbacks, mainly compile-time and memory usage:

Compile-time overhead due to SandboxIR indirection.
Tracking changes for transactions is not free, as for every action performed on the IR we need to create a tracking object and push it into a vector.
Rolling back the IR state is a linear-time operation to the number of actions performed since we started tracking changes.

We have testing in progress for the overhead of the design choices described in this RFC.

Sandbox Vectorizer’s Pass Manager

The vectorizer implements a simple sub-pass manager, specifically designed with vectorization in mind.

It operates on SandboxIR
It supports SBFunction passes, SBBasicBlock passes and SBRegion passes

SBRegion

LLVM’s SLP Vectorization applies a series of transformations on a graph-shaped region of the code. After applying all these transformations it checks for profitability.

To achieve a similar functionality while also maintaining a modular design we introduced the SBRegion, which represents a section of the IR. SBRegions are serializable (meaning that we can describe an SBRegion in a lit test) with the help of LLVM Metadata that simply tag the instructions included in each region.

Current Implementation State

The current implementation is a proof of concept. It includes a basic SandboxIR implementation with a simple transactional scheme and a small set of vectorization-specific instructions. The vectorizer is a very simple bottom-up SLP-style vectorizer along with a couple of simple transformation sub-passes. As this is an early prototype the vectorization transformations are still work in progress and are not currently covering many common vectorization patterns. A snapshot of the code can be found here: Sandbox Vectorizer prototype · vporpo/llvm-project@f167cef · GitHub

Stability wise, it is passing the current llvm/clang lit tests and llvm-test-suite.

We would like to get feedback on the overall approach and design, and to be able to do further development in tree.

mshockwave · May 17, 2024, 6:04pm

It sounds like you want something like VPlan (for SLP) but synchronize with the underlying LLVM IR all the time. But what is the compilation time impact of this design? Because I presume rolling back LLVM IR is more expensive than just throwing away VPlan recipes.

Just want to say that I believe this LLVM rollback mechanism is pretty useful in many (scalar) optimizations, especially those who can’t or finds it difficult to predict the cost of the optimization result ahead of time. In which case we have no choice but cloning the IR region and try different optimizations. But it doesn’t scale well with the size of the IR region.

vporpo · May 17, 2024, 6:18pm

Yes, there is some overlap with VPlan in the sense that you are operating on a layer on top of LLVM IR, but the design goals and implementation are quite different since Sandbox IR stays always in sync with LLVM IR.

There is definitely some additional overhead, but from some early experimentation it does not look to be prohibitive.

Yes, this is exactly the context. Vectorization is such an optimization that requires you to estimate the cost before you modify the IR, which can be quite tricky.

alexey-bataev · May 17, 2024, 6:43pm

I have some concerns about the idea of checkpointing for the vectorizers. I don’t see how this approach could help estimate the cost effectively without significantly increasing compile time. The process would involve a lot of save/restore sequences, which seems inefficient. All potential vectorization transformations depend on legality and cost estimation, rather than just simple code emission followed by cost estimation and reverting, as proposed in this RFC.

As I mentioned before, and still firmly believe, a VPlan-like approach is the best solution for improving SLP vectorization. Integration with the VPlan is essential. Any solutions that do not address the core problems, but merely try to circumvent them, are not effective and may mask underlying implementation issues.

tschuett · May 17, 2024, 7:17pm

It would be a huge improvement when the project could rely on common infrastructure like VPlan, for loop vectorisation, SLP, unvectorized loop optimisations, and more.

vporpo · May 17, 2024, 7:21pm

The main goal is to have a non-monolithic vectorizer where you can individually test transformations with lit tests and also avoid issues where the vectorizer’s IR goes out of sync with the underlying IR, which can be quite error-prone. A transactional IR seems to solve both issues, but at the cost of increased compilation time.

The fact that this could also help fix the cost is also a nice side-effect of having a transactional IR. There are numerous cost-related issues that could benefit from this.

This RFC is not claiming that legality is not an integral step of vectorization. On the contrary, the legality infrastructure is actually tightly coupled with Sandbox IR and is always in sync which makes it almost like a first-class citizen of the IR.
Also, since Sandbox IR is always in sync with LLVM IR, you can also rely on LLVM IR verifiers to check your state, which can spot issues quite early.

Not sure which “core” problems you are referring to. The goal of this RFC is to address the problem of modularity and testability by providing infrastructure that helps with both.

Fair enough, the goal is not to replace either VPlan or the SLP Vectorizer. This is an experimental infrastructure with known potential shortcomings (like compilation-time) but also a lot of potential benefits, like ease of development and testing. So what we would like to do is develop this in parallel with the existing schemes.

qcolombet · May 21, 2024, 11:03am

FWIW, we have a very limited LLVM IR rollback mechanism in CodeGenPrepare.

The principle is much simpler:
We go through a custom IR builder to modify the current IR and that custom builder keeps track of all the actions we did and knows how to revert them.

The advantage is you always work on LLVM IR directly (no shadow IR), the disadvantage is all the modifications have to go through the custom builder, and again that builder is (currently) pretty limited on the kind of actions it allows.

github.com

llvm/llvm-project/blob/b7d911b76a6c44a3fcec56dd7e5fde9fbe63f4fe/llvm/lib/CodeGen/CodeGenPrepare.cpp#L2938


      
          LLVM_DUMP_METHOD void ExtAddrMode::dump() const {
            print(dbgs());
            dbgs() << '\n';
          }
          #endif
          
          } // end anonymous namespace
          
          namespace {
          
          /// This class provides transaction based operation on the IR.
          /// Every change made through this class is recorded in the internal state and
          /// can be undone (rollback) until commit is called.
          /// CGP does not check if instructions could be speculatively executed when
          /// moved. Preserving the original location would pessimize the debugging
          /// experience, as well as negatively impact the quality of sample PGO.
          class TypePromotionTransaction {
            /// This represents the common interface of the individual transaction.
            /// Each class implements the logic for doing one specific modification on
            /// the IR via the TypePromotionTransaction.
            class TypePromotionAction {

vporpo · May 21, 2024, 7:26pm

Thanks for sharing. Yeah, this approach has its advantages. But are there any safeguards that prohibit you from modifying the IR directly (i.e., not going through the API)? Because that could break the revert().

qcolombet · May 21, 2024, 7:43pm

No there isn’t.

Which is the same problem in your proposal iiuc, or any builder really.
E.g., GISel uses observers to trigger events on changes, and hooks them up in builders that are handed to the client. If the client doesn’t use these builders, the observers don’t work.

There’s no free lunch

vporpo · May 21, 2024, 7:57pm

In our case the author of a vectorization “pass” is forced to go through the tracking APIs because there is no way around it. Here is why. A vectorization “pass” overrides a runOnSBFunction(SandboxFunction &SBF) function so as a author of a vectorization pass you can’t even interact with LLVM IR. All you have access to is SandboxIR. So all changes made to SandboxIR (and subsequently to the underlying LLVM IR) are tracked automatically.

In the internals of SandboxIR, of course, you do have access to LLVM IR so you could mess things up there, but the user of SandboxIR can’t.

tmsriram · May 21, 2024, 9:40pm

If I understood this correctly, It should be possible to checksum the IR before registering every tracked change and then bail if the IR checksum while tracking change N+1 does not match the one after change N, which would imply an untracked change was made.

vporpo · May 21, 2024, 10:12pm

Yeah, that should work as a way to confirm that you have not missed any changes. You would also have to check the checksum before you revert() to make sure there are no changes missed after the last one that got registered. Though this looks quite expensive, even for EXPENSIVE_CHECKS.

alina · May 28, 2024, 6:01pm

Chiming in here to mention this is an experiment we (a few LLVM contributors and engineers at Google) would like to run to understand the feasibility and limitations of this approach and ideally we’d like to run it in the open, with input from the community and learning from it together.

We’d like to explore a modular approach, where each component can be experimented with, or even individually rewritten when a better algorithm is proposed. However the current patch is in no shape ready to use at this stage. The current implementation is only a proof of concept with limited vectorization capability, but the infrastructure built to support it, along with extensive unit testing, is already very large in terms of LoC, and it makes more sense to develop in the open and get feedback from the community in the process.

The checkpointing infrastructure also has a lot of potential (including reuse in other environments) and we want to use this scenario to improve its design and test its limits.

The intention is to have this implementation as an experimental variant, and not interfere with the health of the LLVM codebase (through thorough code reviews and the code path disabled by default). We are willing to contribute, review, maintain it and, if we conclude the overheads are too large, delete it from the tree, and hopefully we’ll all have learned something from it in the process :-).

Are there objections to this being an experimental pass in tree?

Side note: bikeshedding on the name of the SandboxIR is allowed and even encouraged =)).

alexey-bataev · May 28, 2024, 7:27pm

I think this is too early to make it even LLVM experimental pass. But adding it in the same form as llvm-polly might be a good idea. WDYT?

tschuett · May 29, 2024, 5:37am

Apparently, there is enough interest in a new approach. I vote for pick a new top directory or a new sub directory in transforms and start contributing upstream. An experimental pass under heavy development is way better than dead passes.

efriedma-quic · May 29, 2024, 7:22pm

One other approach nobody has mentioned is to just clone a region in the IR, guarded by an opaque condition. So you have the same loop twice, guarded by the condition. Then you optimize one side… and then you resolve the opaque condition one way or the other after you’ve run the relevant optimizations. This has a significant advantage in that you can just use regular IR manipulation and existing IR passes, instead of going through a special “sandbox” layer. The disadvantage, of course, is that it requires a bunch of cloning.

The fundamental issue here, though, is that LLVM IR isn’t actually a great IR for vectorization even with “sandboxing”. In particular, the representation of control flow is a big problem. So pretty much everyone tends to use some sort of abstraction layer to represent that. Have you considered translating from LLVM IR to MLIR? MLIR already has very stable infrastructure for translating both ways.

If you do decide to go forward with this, it might make sense to use the incubator process (LLVM Developer Policy — LLVM 19.0.0git documentation)

vporpo · May 29, 2024, 8:34pm

Checkpointing by cloning is indeed relatively simple, but it has its own downsides:

Pointers that point to one copy of the IR won’t work on another copy, so sharing state becomes problematic and requires additional maintenance
You have to rerun any analyses used for each IR copy
You may end up creating a lot of IR copies because without a copy you can’t revert your state. So even though your passes may make minor changes, you would still have to pay the penalty of cloning a large chunk of the IR.

With Sandbox IR we could reuse existing LLVM IR code within the vectorizer pipeline, since at any stage the LLVM IR will be valid, and does not need explicit lowering. For example the pipeline could look like:

SBIR_Pass_1 -> SBIR_Pass_2 -> LLVM_IR_Pass ->SBIR_Pass_3

The only catch is that you won’t be able to track/revert changes across.

Sandboxing is meant to help with modularity, and should help with stability and ease of development. It is very close to LLVM IR by design.

MLIR does not seem to support all the features we need, if I am not mistaken. This design is relying on transactions and on IR observers for tight-coupling with some of the components of the vectorizer.

aeubanks · May 29, 2024, 8:39pm

The question is what’s the burden vs benefit of adding this in-tree in the transforms subdirectory (experimental or not), or as an incubator project.

As an incubator project outside of monorepo:

Pros
- Doesn’t affect LLVM developers
Cons
- Very hard for people to contribute to

As an experimental part of the monorepo under llvm/ (controlled by a CMake flag and off by default, can be broken, similar to in-tree experimental targets):

Pros
- Doesn’t affect LLVM developers
Cons
- Can be broken by LLVM API changes
- Some friction for people to contribute to and use

In-tree as part of core LLVM:

Pros
- Easiest for people to contribute to and play around with
Cons
- May make it harder for LLVM-wide refactorings
- Potentially increased compile time of LLVM itself

Having been involved internally in the discussions around this project, I think it’s hard to know ahead of time if this design will prove to be a good and robust design. It’ll take some experimenting with over time to understand the ramifications of this point in the design space, but I do think it’s worth exploring this. I think the more easily accessible this is for people to play with and contribute to, the more successful this project can be, since one of the main goals of this is to make it easier for people to contribute to some implementation of a SLPVectorizer.

Feel free to disagree, but in my view separate top level monorepo projects or separate incubator github projects are really more for new experimental ways of compilation (e.g. enzyme) or optimization (polly). This project is more similar to things like GlobalISel or NewGVN that are trying a different approach to an existing part of LLVM. These tend to be in-tree. IIUC, NewGVN hasn’t really been a maintenance burden for anyone who doesn’t care about it. Having worked less with codegen, I’m not familiar how much burden GlobalISel has put on the rest of the project.

I’m of the opinion that we put this in-tree, compiled and tested by default, but if it proves to be too much of a maintenance burden, we can either not compile/test it by default (only under a CMake flag), or just remove it altogether. And of course in the long term if this project turns out to not be sustainable, we remove it from the tree.

efriedma-quic · May 29, 2024, 10:05pm

Spent a bit of time looking at the actual code. Maybe I was overestimating how invasive the necessary infrastructure would be; it looks pretty well isolated. And given it’s an IR->IR transform, it’ll probably stay that way. That said, it’s still a significant amount of code.

Can someone post a summary of what the actual vectorization algorithm looks like? I feel like I’m missing something about how the checkpointing actually gets used; I’m not really understanding how fine-grained checkpointing helps SLP vectorization.

vporpo · May 29, 2024, 10:44pm

A transactional IR can help simplify the design of transformations that may or may not change the IR depending on profitability. The most common design pattern for such transformations is to: (i) create an abstraction that describes what your IR would look like if you applied the transformation, (ii) estimate how the new IR would perform by querying the cost model using data from the abstraction, and (iii) either apply the changes to the IR or not, depending on the cost estimation. This is what the current SLP Vectorizer, and many other passes are doing.

However things get complicated when you want to apply a pipeline of transformations. In that case your abstractions need to be increasingly more IR-like, to an extent that would allow you to operate on them i.e., use them both as inputs and as outputs of your transformations. But in most cases your abstractions won’t replicate the whole state of the IR, so you still need to rely on the underlying IR for some of your state. This has the danger of your high-level abstractions going out of sync with the IR which can lead to bugs. Also, and most importantly, using such abstractions as inputs to transformations is not ideal for testing because unlike actual IR these abstractions are not serializable so you can’t just write a lit test to individually test such a transformation.

This is where a transactional IR comes in: You no longer need to develop and maintain such IR-like abstractions. Instead you operate directly on the transactional IR. If your transformation is not profitable you can always revert, just like you would do if you just discarded your high-level abstractions in a traditional pass design. In addition to this, SanboxIR in particular is always in sync with LLVM IR so you can write standard lit-tests to individually test each transformation in the pipeline of transformations that make up the pass.

This transactional-IR-based design seems like a good match for SLP Vectorization, but it is not specific to it. My guess is that there may be other passes that could make use of it.

Topic		Replies	Views
LLVM Loop Vectorizer LLVM Dev List Archives	42	161	October 6, 2012
WIP - MLIR News 72th Edition (26th Nov 2024) Newsletter llvm-weekly	0	106	November 24, 2024
[RFC] Vector Masking Representation in MLIR MLIR	12	1321	September 26, 2022
RV: A Unified Region Vectorizer for LLVM - now on github LLVM Dev List Archives	1	140	November 21, 2016
Questions on LLVM vectorization diagnostics LLVM Dev List Archives	6	115	August 30, 2016