Reducing DWARF emitter memory consumption

Hi all,

We have profiled [1] the memory usage in LLVM when LTO'ing Chromium, and
we've found that one of the top consumers of memory is the DWARF emitter in
lib/CodeGen/AsmPrinter/Dwarf*. I've been reading the DWARF emitter code and
I have a few ideas in mind for how to reduce its memory consumption. One
idea I've had is to restructure the emitter so that (for the most part) it
directly produces the bytes and relocations that need to go into the DWARF
sections without going through other data structures such as DIE and DIEValue.

I understand that the DWARF emitter needs to accommodate incomplete entities
that may be completed elsewhere during tree construction (e.g. abstract origins
for inlined functions, special members for types), so here's a quick high-level
sketch of the data structures that I believe could support this design:

struct DIEBlock {
  SmallVector<char, 1> Data;
  std::vector<InternalReloc> IntRelocs;
  std::vector<ExternalReloc> ExtRelocs;
  DIEBlock *Next;
};

// This would be used to represent things like DW_AT_type references to types
struct InternalReloc {
  size_t Offset; // offset within DIEBlock::Data
  DIEBlock *Target; // the offset within Target is at Data[Offset...Offset+Size]
};

// This would be used to represent things like pointers to .debug_loc/.debug_str or to functions/globals
struct ExternalReloc {
  size_t Offset; // offset within DIEBlock::Data
  MCSymbol *Target; // the offset within Target is at Data[Offset...Offset+Size]
};

struct DwarfBuilder {
  DIEBlock *First;
  DIEBlock *Cur;
  DenseMap<DISubprogram *, DIEBlock *> Subprograms;
  DenseMap<DIType *, DIEBlock *> Types;
  DwarfBuilder() : First(new DIEBlock), Cur(First) {}
  // builder implementation goes here...
};

Normally, the DwarfBuilder will just emit bytes to Cur->Data (with possibly
internal or external relocations to IntRelocs/ExtRelocs), but if it ever
needs to create a "gap" for an incomplete data structure (e.g. at the end of a
subprogram or a struct type), it will create a new DIEBlock New, store it to
Cur->Next, store Cur in a DenseMap associated with the subprogram/type/etc
and store New to Cur. To fill a gap later, the DwarfBuilder can pull the
DIEBlock out of the DenseMap and start appending there. Once the IR is fully
visited, the debug info writer will walk the linked list starting at First,
calculate a byte offset for each DIEBlock, apply any internal relocations
and write Data using the AsmPrinter (e.g. using EmitBytes, or maybe some
other new interface that also supports relocations and avoids copying).

Does that sound reasonable? Is there anything I haven't accounted for?

Thanks,

Will look more closely soon - but I’d really try just writing out type units to MC as soon as they’re done. It should be relatively non-intrusive (we build type units once, there’s no ambiguity about when they’re done) - for non-fission+type units it might be a bit tricky, because the type units still need a relocation for the stmt_list* (I’m trying to find where that’s added now… I seem to have lost it), but fission+type units should produce entirely static type units that are knowable the moment the type is being emitted so far as I can tell (including the type hash and everything - you can write the bytes out to the AsmStreamer, etc and forget about them entirely except to keep the hash to know that you don’t need to emit it again.

I imagine this would provide all the memory savings we would need for much of anything (since types are most of the debug info), and, if not, would be a good start.

*I think we might know what the stmt_list relocation is up-front, though - if that’s the case we’d be able to be as aggressive as I described is the case for fission

Hi all,

We have profiled [1] the memory usage in LLVM when LTO'ing Chromium, and
we've found that one of the top consumers of memory is the DWARF emitter in
lib/CodeGen/AsmPrinter/Dwarf*. I've been reading the DWARF emitter code and
I have a few ideas in mind for how to reduce its memory consumption. One
idea I've had is to restructure the emitter so that (for the most part) it
directly produces the bytes and relocations that need to go into the DWARF
sections without going through other data structures such as DIE and DIEValue.

I understand that the DWARF emitter needs to accommodate incomplete entities
that may be completed elsewhere during tree construction (e.g. abstract origins
for inlined functions, special members for types), so here's a quick high-level
sketch of the data structures that I believe could support this design:

struct DIEBlock {
SmallVector<char, 1> Data;
std::vector<InternalReloc> IntRelocs;
std::vector<ExternalReloc> ExtRelocs;
DIEBlock *Next;
};

// This would be used to represent things like DW_AT_type references to types
struct InternalReloc {
size_t Offset; // offset within DIEBlock::Data
DIEBlock *Target; // the offset within Target is at Data[Offset...Offset+Size]
};

// This would be used to represent things like pointers to .debug_loc/.debug_str or to functions/globals
struct ExternalReloc {
size_t Offset; // offset within DIEBlock::Data
MCSymbol *Target; // the offset within Target is at Data[Offset...Offset+Size]
};

struct DwarfBuilder {
DIEBlock *First;
DIEBlock *Cur;
DenseMap<DISubprogram *, DIEBlock *> Subprograms;
DenseMap<DIType *, DIEBlock *> Types;
DwarfBuilder() : First(new DIEBlock), Cur(First) {}
// builder implementation goes here...
};

Normally, the DwarfBuilder will just emit bytes to Cur->Data (with possibly
internal or external relocations to IntRelocs/ExtRelocs), but if it ever
needs to create a "gap" for an incomplete data structure (e.g. at the end of a
subprogram or a struct type), it will create a new DIEBlock New, store it to
Cur->Next, store Cur in a DenseMap associated with the subprogram/type/etc
and store New to Cur. To fill a gap later, the DwarfBuilder can pull the
DIEBlock out of the DenseMap and start appending there. Once the IR is fully
visited, the debug info writer will walk the linked list starting at First,
calculate a byte offset for each DIEBlock, apply any internal relocations
and write Data using the AsmPrinter (e.g. using EmitBytes, or maybe some
other new interface that also supports relocations and avoids copying).

Does that sound reasonable? Is there anything I haven't accounted for?

Does this design work well with the way llvm-dsymutil uses DIEs and DIEValues?

I'm also interested in whether this will be faster than the current one. I spent some time optimizing the teardown down of the DIE tree in the summer and it would be nice not to lose that. (Sorry, maybe it's obvious from above, but I've only had a moment to skim your proposal. I'll try to look in more detail over the weekend.)

I'm staring at the profile attached to the post #15 on the link you posted, can you confirm that the Dwarf emitter accounts for 6.7%+15.6%=22.3% of the the total allocated memory?
If I understand correctly the numbers, this does not tell anything about how much the Dwarf emitter accounts on the *peak memory* usage (could be more, could be nothing...).

Limiting the number of calls to the memory system is always welcome, so whatever the answer to my question is it does not remove any value to improvements you could make here :slight_smile:

Thanks,

Thanks, I'll look into that. (Though earlier you told me that debug info
for types could be extended while walking the IR, so I wouldn't have thought
that would have worked.)

Peter

>
> Hi all,
>
> We have profiled [1] the memory usage in LLVM when LTO'ing Chromium, and
> we've found that one of the top consumers of memory is the DWARF emitter in
> lib/CodeGen/AsmPrinter/Dwarf*. I've been reading the DWARF emitter code and
> I have a few ideas in mind for how to reduce its memory consumption. One
> idea I've had is to restructure the emitter so that (for the most part) it
> directly produces the bytes and relocations that need to go into the DWARF
> sections without going through other data structures such as DIE and DIEValue.
>
> I understand that the DWARF emitter needs to accommodate incomplete entities
> that may be completed elsewhere during tree construction (e.g. abstract origins
> for inlined functions, special members for types), so here's a quick high-level
> sketch of the data structures that I believe could support this design:
>
> struct DIEBlock {
> SmallVector<char, 1> Data;
> std::vector<InternalReloc> IntRelocs;
> std::vector<ExternalReloc> ExtRelocs;
> DIEBlock *Next;
> };
>
> // This would be used to represent things like DW_AT_type references to types
> struct InternalReloc {
> size_t Offset; // offset within DIEBlock::Data
> DIEBlock *Target; // the offset within Target is at Data[Offset...Offset+Size]
> };
>
> // This would be used to represent things like pointers to .debug_loc/.debug_str or to functions/globals
> struct ExternalReloc {
> size_t Offset; // offset within DIEBlock::Data
> MCSymbol *Target; // the offset within Target is at Data[Offset...Offset+Size]
> };
>
> struct DwarfBuilder {
> DIEBlock *First;
> DIEBlock *Cur;
> DenseMap<DISubprogram *, DIEBlock *> Subprograms;
> DenseMap<DIType *, DIEBlock *> Types;
> DwarfBuilder() : First(new DIEBlock), Cur(First) {}
> // builder implementation goes here...
> };
>
> Normally, the DwarfBuilder will just emit bytes to Cur->Data (with possibly
> internal or external relocations to IntRelocs/ExtRelocs), but if it ever
> needs to create a "gap" for an incomplete data structure (e.g. at the end of a
> subprogram or a struct type), it will create a new DIEBlock New, store it to
> Cur->Next, store Cur in a DenseMap associated with the subprogram/type/etc
> and store New to Cur. To fill a gap later, the DwarfBuilder can pull the
> DIEBlock out of the DenseMap and start appending there. Once the IR is fully
> visited, the debug info writer will walk the linked list starting at First,
> calculate a byte offset for each DIEBlock, apply any internal relocations
> and write Data using the AsmPrinter (e.g. using EmitBytes, or maybe some
> other new interface that also supports relocations and avoids copying).
>
> Does that sound reasonable? Is there anything I haven't accounted for?

Does this design work well with the way llvm-dsymutil uses DIEs and DIEValues?

I haven't looked too closely at what llvm-dsymutil does, so I can't say
for sure. If it only uses DIE/DIEValue to produce DIEs, then it most likely
should work.

I'm also interested in whether this will be faster than the current one. I spent some time optimizing the teardown down of the DIE tree in the summer and it would be nice not to lose that. (Sorry, maybe it's obvious from above, but I've only had a moment to skim your proposal. I'll try to look in more detail over the weekend.)

I think it should be possible to tweak the design to use a bump pointer
allocator like we do now for DIE/DIEValue instead of allocating vectors on
the heap, but I haven't fully thought it through.

Thanks,

Thanks, I'll look into that. (Though earlier you told me that debug info
for types could be extended while walking the IR, so I wouldn't have
thought
that would have worked.)

Yeah, had to think about it more - and as I think about it - I'm moderately
sure type units (which don't include these latent extensions) will be
pretty close to static. With just the stmt_list relocation in non-fission
type units which /should/ still be knowable up-front.

I think these nodes represent allocations from the DWARF emitter:

DwarfDebug::DwarfDebug 9.5%
DwarfDebug::endFunction 15.6%
DIEValueList::addValue 9.1%
total 34.2%

I believe they are totals, but my reading of the code is that the DWARF
emitter does not deallocate its memory until the end of code generation,
so total ~= peak in this case.

I am not surprised by these figures -- see e.g. DIEValueList::Node which in
the worst case can use up to 24 bytes on a 1-byte DWARF attribute record.

Ivan was the person who collected the numbers, he may be able to comment more.

Thanks,

Hi all,

We have profiled [1] the memory usage in LLVM when LTO'ing Chromium, and
we've found that one of the top consumers of memory is the DWARF emitter in
lib/CodeGen/AsmPrinter/Dwarf*.

I'm staring at the profile attached to the post #15 on the link you posted, can you confirm that the Dwarf emitter accounts for 6.7%+15.6%=22.3% of the the total allocated memory?
If I understand correctly the numbers, this does not tell anything about how much the Dwarf emitter accounts on the *peak memory* usage (could be more, could be nothing...).

I think these nodes represent allocations from the DWARF emitter:

DwarfDebug::DwarfDebug 9.5%
DwarfDebug::endFunction 15.6%
DIEValueList::addValue 9.1%
total 34.2%

I believe they are totals, but my reading of the code is that the DWARF
emitter does not deallocate its memory until the end of code generation,

That's sad :frowning:

so total ~= peak in this case.

Assuming the peak occurs during CodeGen (which is what I on my profile), that sounds pretty reasonable!

Thanks for the information (and the work!).

Another question I have, is how worse the split codegen make the situation? Naively there will be a lot of redundancy in the split modules, for ThinLTO Teresa has to proceed with care to limit the amount of duplication.

Mehdi

This is correct: the numbers are total allocations, and I believe there’s not a lot of deallocations there, so this is a good place to optimize.

Hi all,

We have profiled [1] the memory usage in LLVM when LTO’ing Chromium, and
we’ve found that one of the top consumers of memory is the DWARF emitter in
lib/CodeGen/AsmPrinter/Dwarf*.

I’m staring at the profile attached to the post #15 on the link you posted, can you confirm that the Dwarf emitter accounts for 6.7%+15.6%=22.3% of the the total allocated memory?
If I understand correctly the numbers, this does not tell anything about how much the Dwarf emitter accounts on the peak memory usage (could be more, could be nothing…).

I think these nodes represent allocations from the DWARF emitter:

DwarfDebug::DwarfDebug 9.5%
DwarfDebug::endFunction 15.6%
DIEValueList::addValue 9.1%
total 34.2%

I believe they are totals, but my reading of the code is that the DWARF
emitter does not deallocate its memory until the end of code generation,

That’s sad :frowning:

so total ~= peak in this case.

Assuming the peak occurs during CodeGen (which is what I on my profile), that sounds pretty reasonable!

Thanks for the information (and the work!).

Another question I have, is how worse the split codegen make the situation? Naively there will be a lot of redundancy in the split modules, for ThinLTO Teresa has to proceed with care to limit the amount of duplication.

Hmm? Can you reword this slightly? I’m not sure what you’re asking here.

-eric

The parallel split codegen will take the big LTO module with all the debug info and produce multiple modules.
When splitting in multiple modules, you may have functions from the same DICompileUnit ending up in multiple modules. All the retained types would be pulled in.
(this is assuming you are already taking care of not pulling the DICompileUnit when no functions referencing it is in the split module).
Then each thread would do redundant work processing this type hierarchy (and other debug info).

For ThinLTO, Teresa is taking care (review waiting here: http://reviews.llvm.org/D16440 ) to try to import as little as possible, and turn type definition into declaration when possible.

>
>>
>>>
>>> Hi all,
>>>
>>> We have profiled [1] the memory usage in LLVM when LTO'ing Chromium,
and
>>> we've found that one of the top consumers of memory is the DWARF
emitter in
>>> lib/CodeGen/AsmPrinter/Dwarf*.
>>
>> I'm staring at the profile attached to the post #15 on the link you
posted, can you confirm that the Dwarf emitter accounts for
6.7%+15.6%=22.3% of the the total allocated memory?
>> If I understand correctly the numbers, this does not tell anything
about how much the Dwarf emitter accounts on the *peak memory* usage (could
be more, could be nothing...).
>
> I think these nodes represent allocations from the DWARF emitter:
>
> DwarfDebug::DwarfDebug 9.5%
> DwarfDebug::endFunction 15.6%
> DIEValueList::addValue 9.1%
> total 34.2%
>
> I believe they are totals, but my reading of the code is that the DWARF
> emitter does not deallocate its memory until the end of code generation,

That's sad :frowning:

> so total ~= peak in this case.

Assuming the peak occurs during CodeGen (which is what I on my profile),
that sounds pretty reasonable!

Thanks for the information (and the work!).

Another question I have, is how worse the split codegen make the
situation? Naively there will be a lot of redundancy in the split modules,
for ThinLTO Teresa has to proceed with care to limit the amount of
duplication.

Hmm? Can you reword this slightly? I'm not sure what you're asking here.

The parallel split codegen will take the big LTO module with all the debug
info and produce multiple modules.
When splitting in multiple modules, you may have functions from the same
DICompileUnit ending up in multiple modules. All the retained types would
be pulled in.

(this is assuming you are already taking care of not pulling the
DICompileUnit when no functions referencing it is in the split module).
Then each thread would do redundant work processing this type hierarchy
(and other debug info).

For ThinLTO, Teresa is taking care (review waiting here:
⚙ D16440 [ThinLTO] Link in only necessary DICompileUnit operands ) to try to import as little as possible,
and turn type definition into declaration when possible.

Right - I don't think we'd ever need to import a definition - just rely on
the fact that we will produce a type definition somewhere in the output
(this may present problems for LLDB - it's certainly had issues with type
declarations appearing where it would expect a definition (eg: a type that
inherits from a declaration instead of a definition) not sure if that
problem extends to the case of by-value function parameters)

So the impact of that cross-module importuing should be pretty low for
ThinLTO. But the benefit of any work Peter does should be equally
beneficial to ThinLTO, since it still has to emit the types, build all the
DIEs, etc, etc.

- Dave

I’m not sure if you really answered my question though, I may misunderstand what you mean here.

I’m not concerned about ThinLTO, any improvement on the DwarfEmitter would be beneficial for any CodeGen. I’ll try to make my question more clear:

There is a “parallel code generator” for LTO that was added by Peter especially to address Chrome LTO builds. I assume the memory consumption measure we are talking about is using this scheme (it not mentioned how many threads).

When using the multi-threaded codegen, my concern would be that your 24 threads (random number here…) may emitting the same Dwarf informations again and again, which would make the 30% memory usage not surprising. Since we noticed this has a huge impact on ThinLTO, I was pointing an orthogonal way of addressing the memory concern for Chrome LTO.

Hi all,

We have profiled [1] the memory usage in LLVM when LTO’ing Chromium, and
we’ve found that one of the top consumers of memory is the DWARF emitter in
lib/CodeGen/AsmPrinter/Dwarf*.

I’m staring at the profile attached to the post #15 on the link you posted, can you confirm that the Dwarf emitter accounts for 6.7%+15.6%=22.3% of the the total allocated memory?
If I understand correctly the numbers, this does not tell anything about how much the Dwarf emitter accounts on the peak memory usage (could be more, could be nothing…).

I think these nodes represent allocations from the DWARF emitter:

DwarfDebug::DwarfDebug 9.5%
DwarfDebug::endFunction 15.6%
DIEValueList::addValue 9.1%
total 34.2%

I believe they are totals, but my reading of the code is that the DWARF
emitter does not deallocate its memory until the end of code generation,

That’s sad :frowning:

so total ~= peak in this case.

Assuming the peak occurs during CodeGen (which is what I on my profile), that sounds pretty reasonable!

Thanks for the information (and the work!).

Another question I have, is how worse the split codegen make the situation? Naively there will be a lot of redundancy in the split modules, for ThinLTO Teresa has to proceed with care to limit the amount of duplication.

Hmm? Can you reword this slightly? I’m not sure what you’re asking here.

The parallel split codegen will take the big LTO module with all the debug info and produce multiple modules.
When splitting in multiple modules, you may have functions from the same DICompileUnit ending up in multiple modules. All the retained types would be pulled in.
(this is assuming you are already taking care of not pulling the DICompileUnit when no functions referencing it is in the split module).
Then each thread would do redundant work processing this type hierarchy (and other debug info).

Right. I think, in general, that the code generation profile we’re seeing here is going to be the same no matter the compilation mode, but merely compounded depending on how much debug info there is ultimately in the IR/etc.

-eric

Hi Mehdi,

Update: Peter told me on IRC that he believes the measure was made with single-threaded codegen. I wonder how worse the number would be with threading enabled :slight_smile:

I've implemented a change which does this, and looked at impact on memory
consumption and binary size running "llc" on Chromium's 50 largest (by
bitcode size) translation units. Bottom line: *huge* savings in total memory
consumption, median 17% when compared to before the change, median 7% when
compared to type units disabled.

(I'm not yet confident that my patch is correct (some of the section sizes
are different and I'll need to double check what's going on there) but I'll
send it out once I'm confident in it.)

I think we can do better, though. With type units enabled, the size of
.debug_info as a fraction of (.debug_info + .debug_types) is median ~40%,
so I think there's another ~12% that can be saved by avoiding DIE/DIEValue
retention for debug_info, bringing the total to ~30%. I expect numbers with
type units disabled to be in the same ballpark (with type units enabled,
we consume ~25% more space in the object file on .debug_info + .debug_types,
so the proportional savings may be less, but the absolute memory consumption
should be lower). This also roughly lines up with the heap profiler figures
from before.

My conclusion from all this: I think we should do it, and I think it would
especially help in LTO mode with type units disabled: the type units feature
is redundant with LTO deduplication and would therefore add unnecessary bloat
to object files, which would mean increased memory usage (I measured a ~10%
median increase in memory usage comparing the current type units implementation
against type units disabled -- not an entirely fair comparison, but probably
good enough).

I have a plan in mind for doing this incrementally: we will start using the
more efficient data structure at the leaves of the DIE tree, and gradually
expand out to the root. You'll see what that looks like once I have my first
patch ready.

Thanks,

Thanks, I’ll look into that. (Though earlier you told me that debug info
for types could be extended while walking the IR, so I wouldn’t have
thought
that would have worked.)

Yeah, had to think about it more - and as I think about it - I’m moderately
sure type units (which don’t include these latent extensions) will be
pretty close to static. With just the stmt_list relocation in non-fission
type units which /should/ still be knowable up-front.

I’ve implemented a change which does this, and looked at impact on memory
consumption and binary size running “llc” on Chromium’s 50 largest (by
bitcode size) translation units. Bottom line: huge savings in total memory
consumption, median 17% when compared to before the change, median 7% when
compared to type units disabled.

(I’m not yet confident that my patch is correct (some of the section sizes
are different and I’ll need to double check what’s going on there) but I’ll
send it out once I’m confident in it.)

I think we can do better, though. With type units enabled, the size of
.debug_info as a fraction of (.debug_info + .debug_types) is median ~40%,
so I think there’s another ~12% that can be saved by avoiding DIE/DIEValue
retention for debug_info, bringing the total to ~30%. I expect numbers with
type units disabled to be in the same ballpark (with type units enabled,
we consume ~25% more space in the object file on .debug_info + .debug_types,
so the proportional savings may be less, but the absolute memory consumption
should be lower). This also roughly lines up with the heap profiler figures
from before.

My conclusion from all this: I think we should do it, and I think it would
especially help in LTO mode with type units disabled: the type units feature
is redundant with LTO deduplication and would therefore add unnecessary bloat
to object files, which would mean increased memory usage (I measured a ~10%
median increase in memory usage comparing the current type units implementation
against type units disabled – not an entirely fair comparison, but probably
good enough).

I have a plan in mind for doing this incrementally: we will start using the
more efficient data structure at the leaves of the DIE tree, and gradually
expand out to the root. You’ll see what that looks like once I have my first
patch ready.

This sounds reasonable and I like the plan as well.

Thanks for looking into this Peter!

-eric

>
> > Thanks, I'll look into that. (Though earlier you told me that debug
info
> > for types could be extended while walking the IR, so I wouldn't have
> > thought
> > that would have worked.)
> >
> >
> Yeah, had to think about it more - and as I think about it - I'm
moderately
> sure type units (which don't include these latent extensions) will be
> pretty close to static. With just the stmt_list relocation in non-fission
> type units which /should/ still be knowable up-front.

I've implemented a change which does this, and looked at impact on memory
consumption and binary size running "llc" on Chromium's 50 largest (by
bitcode size) translation units. Bottom line: *huge* savings in total
memory
consumption, median 17% when compared to before the change, median 7% when
compared to type units disabled.

(I'm not yet confident that my patch is correct (some of the section sizes
are different and I'll need to double check what's going on there) but I'll
send it out once I'm confident in it.)

I think we can do better, though. With type units enabled, the size of
.debug_info as a fraction of (.debug_info + .debug_types) is median ~40%,
so I think there's another ~12% that can be saved by avoiding DIE/DIEValue
retention for debug_info, bringing the total to ~30%. I expect numbers with
type units disabled to be in the same ballpark (with type units enabled,
we consume ~25% more space in the object file on .debug_info +
.debug_types,
so the proportional savings may be less, but the absolute memory
consumption
should be lower). This also roughly lines up with the heap profiler
figures
from before.

My conclusion from all this: I think we should do it, and I think it would
especially help in LTO mode with type units disabled: the type units
feature
is redundant with LTO deduplication and would therefore add unnecessary
bloat
to object files, which would mean increased memory usage (I measured a ~10%
median increase in memory usage comparing the current type units
implementation
against type units disabled -- not an entirely fair comparison, but
probably
good enough).

Oh, that's a fair point, for sure - you're particularly interested in LTO
where, I agree totally, type units are entirely overhead.

Pity, as I sort of liked that solution - not having to complicate the DIE
hierarchy, etc.

But improving the DIE hierarchy itself has more general benefits, for sure
- outside just type units, and outside just LLVM itself. Into tools like
llvm-dsymutil, which is nice.

I have a plan in mind for doing this incrementally: we will start using the
more efficient data structure at the leaves of the DIE tree, and gradually
expand out to the root. You'll see what that looks like once I have my
first
patch ready.

Hmm, rightio - I'm not sure I quite picture why/how it would be incremental
up the tree, if you want to just chat about it or sketch out the general
design I'd be happy to hear, or we can talk over a patch if that's easier.
(just want to save you time if working up the patch is going to be a lot of
work & we might end up making larger design changes to it)

- Dave