We have our compilation pipeline running on-device just in time. We store constantOps with denseElementAttributes.
During constant folding many constants can get replaced or new ones created.
At the end we file back all our constants to keep the footprint on host minimal.
But the MLIRContext keeps holding on to the denseElementAttributes even though we have erased the constantOps and the denseElementAttributes have no one referencing them.
The networks parameters can be as big as 450MB, getting bigger every year and that running on embedded devices we can’t afford to keep large unused buffers around.
We are having to serialize and deserialize our module to new context and deleting old context. This is a performance hit during network load time. And also we can only do it at the end so during constantFolding the peak memory footprint is still huge with duplicate denseElementAttributes being allocated but not freed immediately since it is zone allocated in StorageUniquer.
So the question is, is there a way or an option to allocate these denseElementAttributes and reference count them, such that whenever there are no references to them the underlying memory is freed immediately and returned to the heap/OS? Or some other way?
I imagine many inference networks might run into this as well.
Are there ways to get this “free immediately” behavior and if not, can we have such an option? What would be the best way for this?
In retrospect, I think that uniqueing these large attributes (for the life of the context) was a design flaw. There are various ways to work around the limitations in special cases, but I think that, as you point out, these should have been reference counted all along.
I’ve looked at what it may take to walk back that design point but, tbh, I haven’t found something non invasive. I’m not sure if there is some kind of cleverness that can be done in the implementation, or if this just needs an invasive overhaul.
I don’t think there is anything that can be done here that composes correctly - we don’t want to refcount these, because then we’d have to refcount arrays and other aggregates that contain attributes.
The right answer (imo) is to not use these for non-trivial size data. Certainly x00MB of tensor data shouldn’t be stored here. MLIR gives lots of ways to store things in other places, e.g. as a StringAttr that is a filename containing the data.
This is the design flaw I was referring to. The issue is that we have fairly pervasive use of ElementsAttr for constant folding and other optimizations, when (imo), there should have been a layer of indirection there, specifically because these have a tendency to represent x00MB data and such. What is right for data of that volume is not right for the more normal Attribute usage (i.e. context-uniquing regular attributes is desired).
Some systems do use either special custom constant ops or OpaqueElementsAttr to divert these to on-disk/mmap’d storage at rest (i.e. before saving), but that is only part of the problem. In the course of compiling a program with such constants, we may create updated copies of them multiple times, and those all live for the life of the context – and this is what drives high peak memory usage.
I’ve had in my mind for a while to have a setup based on OpaqueElementsAttr where we at least dump all of the permutations to disk vs keeping them resident. On large systems with a lot of swap, this probably doesn’t make much of a difference, but on mobile/embedded systems (typically with little or no swap), wired memory is at a premium and it may help. That still isn’t what I would design from the get-go if trying to construct a tensor-program compiler for memory constrained systems, but is a bit more tractable.
I do think that at some point, if we care about peak memory usage of the compiler for these classes of programs, we need a better solution than just rolling your own with StringAttrs and the like. But I also think that doing that “right” and consistently would likely be pretty invasive to constant handling code that exists today. I’d want to design such a thing explicitly and then set it right once.
What if DenseElements storage is modified to allocate constant sizes beyond a certain threshold on heap instead of the bump pointer allocator (and not uniqued, nor ref counted) and attributes carry a bit saying if they are on heap. When an operation is freed, attributes with such a bit are explicitly destroyed in the op’s dtor. In addition, the comparison operators will have to be modified to not use pointer equality. Can this work?
I could have sworn we had a clone-into-another-context method, but seems i’m misremembering. These I sort of consider a manual GC sweep
Chris’s suggestion and/or using something like an OpaqueAttr would probably be the simplest. Now the downside is you’d probably end up needing to do additional work for constant folding … Well, actually I don’t think you would be able to get constant folders to work as you expect & not have duplicates as Stella alludes to too: I could think one could (conceptually) record the pre-folding bump allocator states, get the generated elementsattr, convert back to “external” constant and rollback the allocators … which I don’t think the BumpPtrAllocator actually allows, so allocation strategy would need to be changed to. Mmm, if we could run constant folding in a different context than the context of the ops, then that could work (as we could just destroy the MLIRContext of the folding)
So it would not trivial change (there are potentially some cases where it is simple though).
But if you could forego general constant folding, then the “external reference” approach would work.
This would mean that a (large) constant value reused by different functions is guaranteed to be duplicated to enable simple destruction. That would work in workloads where there is not duplicated constants and/or mostly replaced in place. But this would work with the existing folders.
Which would mean that whether a function is cheap or expensive is unknown - e.g., CSE’s runtime may now be or not be dependent on size of the attributes of the ops. That would make writing optimizations more difficult.
I think we need to elaborate this one more… this is the closest to how the system was intended to work. As long as we can implement the same kinds of things we do today for constant folding, we can change the mechanism. Not sure what would look like but might be worth considering.
Some more comments inlined, mostly for threads I’ve gone down in my head before and mostly felt weren’t right.
A variant I’ve played with in my head is to have a different arena (could just be the heap) for large attribute storage like this. Then instead of tying it to operator destruction, have some kind of GC that could be triggered, saying “only retain arena allocated entities referenced from these modules”: then folks who have memory pressure/care could trigger some compaction by promising that they are only keeping certain root operations live.
Or we do something where we require these new LargeElementsAttrs to only be carried by special constant operations with the behavior you describe. As Jacques says, without some kind of sharing, I suspect we’re going to have more copies floating around.
I imagine that both approaches could be made to work after a fashion, but they are pretty big departures from how allocation works today and feel fairly at odds with it.
We might be able to come up with a narrower contract, but generalized constant folding of large constants is pretty critical. There may be a better way to go about it, though – perhaps with some better lifetime support?
Another variant I considered was being able to checkpoint the current context and be able to roll back allocations to a previous state. This or the separate context may help cut down on some of the temporaries induced by constant folding but it would still leave you with large allocations of state0 and stateN, with a lot of dead/waste.
We also have OpaqueElementsAttr (different from OpaqueAttr). That always felt to me like, if generalized a bit, might be a path to a solution (and could allow things like compacting storage for non-live elements while still providing uniquing support for the Attribute handle).
+1 to this: I see this as a good way to avoid “refcounting everything” and being selective about what can be compacted or not. I discussed this with River multiple times in the past actually, but we’ve been slacking off on this because everything is in a single arena in the context, and designing something generic that fits the current system isn’t gonna be easy.
I was planning to look into this in the context of TensorFlow in the next quarter or so though.
(This seems like a pretty important feature in general for online/JIT systems that want to re-use a partially initialized context from run to run (i.e. load a bunch of modules of builtins/transformations/etc and be able to re-use that, throwing away the state from each actual compilation). Lmk if you ever want to chat about this more)
We also faced with the same issue in our project. And I believe I’ve already asked similar question some time ago)
We overcome this using “lazy constant folding”. Instead of creating new DenseElementsAttr object we are using special constant-like operation and a set of attributes to store original constant data and transformations to apply on it. The transformations are applied on-the-fly (via special transform-like range), when the data is accessed via the operation API. The implementation is quite bound to internal use cases, working on its generalization right now.
I believe Stella already answered this quite eloquently; echoing similar thoughts.
We do file back all our constants at the end with the final product, but during compilation doing this has 2 disadvantages in my mind as everyone else pointed out, firstly the constant folding gets tricky and we use that a lot for stuff like folding batchNorm into convolutions. Second, even if we did constant folding with files, that is a lot of disk accesses and will be a performance hit.
There was a clone with context but it used to ignore the context passed in I found that the hard way. I don’t see it anymore perhaps it was removed because it wasn’t using the context. I have just created my own tiny little helper to do this manual GC
Thanks that seems interesting, however I believe in our case we want to aggressively constant fold in the IR with other passes in the pipeline reaping the benefits of the fold as soon as possible.
I think this can work for our case and we are happy to adopt it. We will trigger the GC aggressively and free up memory. We won’t need to serialize/deserialize to destroy contexts and all the memory stays in RAM only as long as we know it is needed.
If we wanted to be super-aggressive we could invoke it after every constant fold invocation ourselves and really keep memory low but hurting performance. It is overkill and we probably should not really do it, but this option might let us do that which is cool, we get to balance perf and memory at our own discretion.
But this sounds like a problem with the design of some specific dialect: it doesn’t make sense to “constant fold” weights. Quantization and other preprocessing tools should explicitly manage large weight data, not manipulate them through constant folding. Constant folding (e.g. through fold methods) is supposed to be always profitable, but folding “+1” into a large weight matrix would double the size of your weights: independent of how MLIR stores it, this is not profitable for the download size of a model!
OTOH, there are very small tensors that make sense to expose to the optimizer (e.g. zero-d ones) so having the ability to model them as typical MLIR attributes makes sense.
We need to have a predictable and stable set of semantics for these. Changing behavior depending on the size will only mask certain problems and make it harder to design and maintain the compiler IMO.
With that viewpoint, every dialect operating on such tensor constants has a problem (including math and tensor). It has been like this since the beginning. I agree with you (and have often wished we instead had a facility like @vinograd47 mentions UT) in that these transformations being automatic and not done with a cost model is a problem waiting to happen. I would rather have machinery which builds a small graph of foldable expression trees and lets choices be made: there are plenty of cases where the right optimization is not to greedily fold but to partially materialize or materialize-once-at-init (or other things that trade off duplication from runtime cost).
With that said, for the kind of tensor programs that typically get processed through the frontends, simple constant folding into (constant) weights is often a benefit and I think folks rely on it pretty heavily. It is also pretty fragile, though, and I wish it hadn’t been built that way (these decisions had already been put in place when I joined the project – this has annoyed me for a while but not quite at the level of doing anything about it).
In IREE, we root such constants differently, with explicit decisions to make them immutable and sink them into various regions. I believe that still relies on the default (ElementsAttr based) constant folding machinery at some level but at least introduces a decision point. We’re not very sophisticated on any of that yet, but it does at least prevent the duplication bloat with its simple heuristic to sink weights as constants or not.
There are a lot of such cases where it is beneficial to aggressively fold weight constants. However, as you point out, there are a very small number of cases that should probably be considered universally beneficial.
I see the folding hook in MLIR to be similar to the canonicalize patterns: there are supposed to be locally decidable, universally beneficial, and generating simpler IR.
In the context of “folding” the main assumption is that any materialized constant must be cheap. Any kind of constant folding that can create a “non-cheap” constant does not belong here IMO.
The problem for something like the tensor dialect operations, is that the concept of “cheap” is not universal… We worked around this in TensorFlow with a crude heuristic: tensorflow/constant_fold.cc at master · tensorflow/tensorflow · GitHub
// Implements a TF specific policy on when constant folding is allowed.
// Policy: Always allow folding if we do not know the shape of an operand or
// result i.e., one of these values has non-static shape. If we know all the
// shapes, find the total size of the operands and results. Folding of the op is
// allowed if one of the following conditions are met:
// 1. size of results is less than a certain threshold (`kSizeThreshold`), or
// 2. size of results is within a factor (`kSizeFactor`) of size of operands
This is of course not address the general question of constant folding, in particular when weights are involved. It is only intended to keep canonicalization powerful enough when “small constants” are present.
Note that this is still a bit of a different topic than the question of being able to reclaim memory / GC explicitly from the context.
Can we enumerate some cases where this is “often a benefit”? The only case I’m aware of is folding inference batchnorm into weights/biases. In IREE we’ve had some speculative discussions about folding “pre-packing” of weights in this way, but that seems well beyond a random fold/canonicalization pattern.
To throw in another example of where “random canonicalization/folding” cannot be allowed to run wild, if you have your weights stored as uint8 just for saving download cost (not execution speed), then typically the first thing the program does is dequantize them (expected that to be fused) and do a float matmul. The compiler without any imposed semantics / cost model will always think “oh gee, I can constant fold this dequantization work” and turn the uint8 weights into f32, defeating the entire purpose of storing the weights in uint8 in the first place.
We maintain this distinction in the tf_saved_model dialect, but in all backends I’m aware of we seem to just eagerly inline all the values as std.constant and lose that distinction
Pretty far off topic at this point but can you send me offline context of where you see that (i.e. afaik, we convert those to globals and then have a heuristic further down as to whether to sink them – but not all input dialects make the distinction and things can end up as loose constants easily).
I’m not going to be able to come up with a comprehensive list. I just remember when the Lite folks were pushing on this, I had made a number of cautions to this effect but a lot of the optimizations involved getting things into a foldable state and then just letting constant folding handle it. Ime, a lot of these “optimize for inference” type things operate in this mode, converting variables to constants, doing greedy constant folding and then having various carve-outs to keep it from doing an un-beneficial thing. For such tools, it gets worse than that in that usually the automatic constant folding isn’t just an optimization but exists for correctness: without it, shapes-as-tensors and other things that the runtime actually doesn’t have an implementation for are expecting to be able to be folded away. Painting with a broad brush, I know – I’ve just seen a lot of these kinds of things and believe it to be baked in pretty deep to such tools.
I am not super up to date on all those dialects, but I don’t think there is a structural problem here, it is just that no one filled in a missing piece. The trick here is that you don’t need to / want to use a “constant op” to define a large out of line tensor data. For example, consider this pseudo mlir:
In this case, folding “addOne” into cstHuge would be a big problem, but it is perfectly fine to fold it into “cstTiny”. However, we don’t want all the fold hooks for all the operations to have cost models. We also don’t want to store “some huge value” in immortal memoized MLIR memory.
The solution here is to… add another op (no surprise, this is MLIR after all)! Your frontend would just generate this like this:
Trust me I know, and this is why I’ve never been happy with the reliance on automatic constant folding for these little inference optimizers. Taking that approach tends to produce early gains when looking at simple programs common in basic ml inference scenarios. But then memory and storage blows up when you carry that forward to more complicated things, and it is really hard to unwind (with a tendency to then introduce various things to selectively block constant folding, etc). Speaking hypothetically for a friend, of course
In IREE, we make a distinction internally, and as Mehdi points out, the TF dialect does have a simple heuristic – but there’s a lot of slop across the ecosystem on this topic. It would be good to get one solid frontend example done right so we could at least have something to point to as the correct way to be doing this. Not quite at the top of my list of issues at the moment, but it will be before the end of the year.
Particularly appropriate as folding is called during canonicalization
Another one that came up during TF policy design was that some ops need to be folded to enable converting the model. That’s a little bit of a weird request and so not captured in the model, and would mean effectively that we have another pass that would fold independent of the folding policy. But the fold functions written all expect that they get dense elements attributes as input and produce one, so we’d incur the cost.
Except where cstHuge is referenced by one op where being able to update the value in memory is profitable (removes an operation, no memory increase), but that’s an edge case
Agreed. Except where one wants to force folding for an unsupported op (as Stella also mentioned). There current folders won’t behave well - but that is a problem that could be solved differently instead.
As Mehdi mentioned the GC one is adjacent and interesting here too.
I think it is worthwhile to explore to when this happens and where (had a report about a 30x memory overhead case, so might be a nice test case here). Perhaps some good memory usage profiles from actual use cases (sounds like a nice experiment )