Handling "gigantic" weights

Hey folks, I’m working on support for “gigantic weights” in Torch-MLIR, and I’m trying to understand which of the new tools we have is best. I currently see two new features in MLIR to help support this

  1. ml_program.global with an external value, which is then linked into the program by symbol name later
  2. Resources. It seems like the stuff in thread landed and is being used for giant models, but I haven’t found any docs or guidance for how to use this feature in general, or specifically for large weights.

At the end of the day in Torch-MLIR we just need an association between a name like “foo.bar.baz” and an external thing. For now I’ve started with using ml_program and symbol references since I found that documented and low-tech.

I guess the resources thing will be more useful when we want to carry those giant parameters in the MLIR compiled artifact, rather than having them on the side in the model checkpoint? Still, seems like keeping 500GB+ of parameters at rest is the best approach. Even the one-time cost of converting those into MLIR resources seems excessive.

cc @River707 @stellaraccident

1 Like

Thanks @_sean_silva , I still need to land docs. I’ve been very distracted lately, but intend to get to that this week.

The cost of “converting into MLIR resources” depends greatly on the context. A blob resource is effectively just a tuple of <ArrayRef, metadata (like alignment, mutability, etc.), deallocator>, the builtin DenseResourceElementsAttr can hold these. For an easy example, if you wanted to create an attribute holding data that is owned by someone else, that could be as easy as:

// Create a resource blob with the data. `UnmanagedAsmResourceBlob` doesn't copy,
// it just references the data directly.
ArrayRef<T> myData = ...; 
AsmResourceBlob blob = UnmanagedAsmResourceBlob::allocateInferAlign(myData);

/// Some name to use for the resource.
StringRef nameForBlob = ...;

// Create an attribute that references the given resource blob.
// (This overload also inserts the resource, but there are others that don't.
ShapedType type = ...;
auto attr = DenseResourceElementsAttr::get(type, nameForBlob, blob);

When we create resources it’s almost always “free” because we either just mmap, hold a reference to data someone else owns (often ref counted), etc. Right now we only ever really pay any “alloc” cost when loading MLIR (given we need an upstream way of knowing we can mmap/just reference the bytecode buffer directly when loading resources).

Yeah, the part about loading/saving the MLIR bytecode is what I’m worried about. Given the sizes involved, even a really, really fast SSD is going to struggle to write out that file in less than 10 minutes, even if the compute graph itself would be “instant”. Torch-MLIR is right at the boundary between Torch and MLIR, so unfortunately where we sit, we’re going to be doing this a lot as part of our dev flow.

Also, due to “different LLVM versions” constraints, many of the consumers of Torch-MLIR’s output go through a serialization step before pulling it into their in-memory MLIRContext. That code right now looks like load(serialize(module)) – that seems to break mmaping since it materializing an intermediate “deep” serialization of the module as a Python string/etc. The use case here is “I convert torch dialect to (say) tosa, and now I need to serialize for consumption by a different LLVM version” – assuming that we start with a torch dialect .mlirbc with giant resources, we just want to “write the file with the same resources, but with the new tosa dialect IR” – how to avoid duplicating the resources on disk or in memory? The only way I can think of that is portable is to mutate the original file and overwrite the torch dialect IR with the new tosa IR (but obviously that is weird – the IR might be a different size, etc.).

As far as I know, resources don’t force you to serialize them all in the module output: you can very well serialize an IP address of a server + an ID to recover the data later for example. There is enough flexibility to handle all this I believe, it’s true that we’re missing examples of all this in tree!

Yes, the resource only requires that the dialect it is associated with knows how to interpret it and that doesn’t require in module as Mehdi says.

I was also combining these two: have an ml_program global with initializer as resource.

ml_program on its own should handle the “someone else knows what this refers to” level.

Yes, but if it is just an ID, it can be easily embedded in the IR since it is “small” – what is the advantage of using a resource in that case?

My perspective: this is so far a (good) discussion about the mechanism for embedding/linking but the actual driver ime is use case. I think there are at least three:

  1. Reference to some immovable storage where the weights live. Depending on case, this could be anything from a void*, to file/location, to device/pointer. This most often shows up in “online” cases where some outer runtime is involved. Depending on how torch-mlir is being used, this is probably a real case (ie. You would need to capture live references if wanting to interop with the eager executor).
  2. Symbolic reference to some framework specific storage (ie. File, checkpoint, etc).
  3. Snapshotted and embedded (either as frozen constants or still mutable variables). This is most often the case when trying to create a hermetic/deployable artifact of some kind.

The third is really the only case where we care about the efficiency of encoding and minimizing copies as more of a first order thing. I see two sub cases of this:

A. “Small” where we bundle everything up together into the program (or. Mlir resources). Has the advantage of being real convenient, can’t be messed up later, etc.

B. “Large” where we are talking about snapshotted data that is so large that a user wants to optimize for making at most one copy of it. Some of these terabyte scale things push me this way. For these, I would want some kind of modality that let’s me serialize these to independent resources (files) and link symbolically from the main program.

For a toolkit like torch-mlir, based on usage, you’ll probably need to support all of these at some point but can leave most of the opinions to users of the tools. I think what is coming up now is a practical difference between 3a and 3b whereas mostly the happy path of the tooling has been focused on 3a. Given that we just made that way more efficient with bytecode/resources/etc, it probably makes sense to keep getting the most we can or of that for the short term while keeping an eye on 3b: these things are only getting bigger and my gut tells me that as yet more strategies will be required to handle the truly massive.

Resource is just our standard way to decouple “storage” of the data from how it gets hooked in the IR.
You can reinvent it all with a string attribute and lot of custom handling around it I guess.