Increasing address pool reuse/reducing .o file size in DWARFv5

tl;dr: in DWARFv5, using DW_AT_ranges even when the range is contiguous reduces linked, uncompressed debug_addr size for optimized builds by 93% and reduces total .o file size (with compression and split) by 15%. It does grow .dwo file size a bit - DWARFv5, no compression, not split shows the net effect if all bytes are equal: -O3 clang binary grows by 0.4%, -O0 clang binary shrinks by 0.1%
Should we enable this strategy by default for DWARFv5, for DWARFv5+Split DWARF, or not by default at all/only under a flag?

So, I’ve brought this up a few times before - that DWARFv5 does a pretty good job of reducing relocations (& reducing .o file size with Split DWARF) by allowing many uses of addresses to include some kind of address+offset (debug_rnglists and loclists allowing “base_address” then offset_pairs (an improvement over similar functionality in DWARFv4 because the offset pairs can be uleb encoded - so they can be quite compact))

But one place that DWARFv5 misses to reduce relocations further is direct addresses from debug_info, such as DW_AT_low_pc.

For a while I’ve wondered if we could use an extension form for addr+offset, and I prototyped this without an extension attribute, but instead using exprloc. This has slightly higher overhead to express the… expression. (it’s 9 bytes in total, could be as few as 5 with a custom form)

But I had another idea that’s more instantly deployable: Why not use DW_AT_ranges even when the range is contiguous? That way the low_pc that previously couldn’t use an existing address pool entry + offset, could use the rnglist support for base address.

The only unnecessary address pool entries that remain that I’ve found are DW_AT_low_pc for DW_TAG_labels - but there’s only a handful of those in most code. So the “ranges everywhere” strategy gets the addresses for optimized clang down from 4758 (v4 address pool used 9923 addresses… ) to 342, with about ~4 “extra” addresses for DW_TAG_labels.

This could also be a bit less costly if DWARFv5 rnglists didn’t use a separate offset table (instead encoding the offsets directly in debug_info, rather than using indexes)

I have patches for both the addr+offset exprloc and for the ranges-always, both with -mllvm flags - do people think they’re both worth committing for experimentation? Neither? Default on in some cases (like Split DWARF)?

Thanks,

  • Dave

For the record I think the options being committed so we can get a better idea on how it looks would be fine. We should definitely figure out what seems to work best and leave it there, but in the meantime I think your plan sounds good.

-eric

I think this sounds like a good plan for Linux. I would like to see the numbers for Darwin (= non-split DWARF) to decide whether we should just make that the default. Eric's suggestion of having this committed as an option first seems like a good step in that direction. If it is an advantage across the board we can remove the option and just make this the default behavior.

thanks,
adrian

On some previous occasion that introduced additional indirection
(don't remember the details) my debugger people groused about the
additional performance cost of chasing down data in a different
object-file section. So we (Sony) might be happier with low_pc as
expressions, than with a ranges-always solution.

But hard to say without data, and getting both modes in at least
as a temporary thing sounds like a good plan.
--paulr

Sounds good all round - I’ll commit these two modes, and maybe even the third (given Sony’s interest & possible interest in changing their consumer to handle it) of a custom form to eek out the last few bytes from the more direct addr+offset encoding.

I’ll follow up here with flag names and revision numbers once they’re in.

I don’t totally follow the proposed encoding change & would appreciate a small example.

Is the idea to replace e.g. an ‘AT_low_pc () + relocation for ’ with an ‘AT_low_pc ( + offset)’, s.t. the cost of a relocation for the address is paid down the more it’s used? How do you figure the offset out?

thanks,
vedant

I don’t totally follow the proposed encoding change & would appreciate a small example.

Is the idea to replace e.g. an ‘AT_low_pc () + relocation for ’ with an ‘AT_low_pc ( + offset)’,

With Split DWARF or with DWARFv5 in LLVM at the moment, all addresses are indirected already. So it’s:

Replace “AT_low_pc ()” with an “AT_low_pc ( + offset)”.

s.t. the cost of a relocation for the address is paid down the more it’s used?

Right - specifically to reduce the pool of addresses down to, ideally, one address per section/indivisible chunk of machine code (per subsection in MachO, for instance) (whereas currently there are many addresses per section)

How do you figure the offset out?

Label difference - same as is done for DW_AT_high_pc today in DWARFv4 and DWARFv5 in LLVM. high_pc currently uses the low_pc addresse to be relative to, in this proposed situation, we’d use a symbol that’s in the first bit of debug info in the section (or subsection in MachO). So the low_pc of the subprogram/function, for instance, or if there are two functions in the same section with debug info for both, the low_pc of the first of those functions, etc…

I think I get it now, thanks for explaining!

I don’t totally follow the proposed encoding change & would appreciate a small example.

Is the idea to replace e.g. an ‘AT_low_pc () + relocation for ’ with an ‘AT_low_pc ( + offset)’,

With Split DWARF or with DWARFv5 in LLVM at the moment, all addresses are indirected already. So it’s:

Replace “AT_low_pc ()” with an “AT_low_pc ( + offset)”.

s.t. the cost of a relocation for the address is paid down the more it’s used?

Right - specifically to reduce the pool of addresses down to, ideally, one address per section/indivisible chunk of machine code (per subsection in MachO, for instance) (whereas currently there are many addresses per section)

How do you figure the offset out?

Label difference - same as is done for DW_AT_high_pc today in DWARFv4 and DWARFv5 in LLVM. high_pc currently uses the low_pc addresse to be relative to, in this proposed situation, we’d use a symbol that’s in the first bit of debug info in the section (or subsection in MachO). So the low_pc of the subprogram/function, for instance, or if there are two functions in the same section with debug info for both, the low_pc of the first of those functions, etc…

If the label difference in a low_pc attribute is relative to the start of a section, could a linker orderfile pass break the dwarf unless it updates the offset? Ditto, I suppose, for an intra-function offset when something like propeller is used to reorder basic blocks (I’m thinking of At_call_return_pc now).

Apologies if this has been answered elsewhere, I suppose there must be a solution for this for At_high_pc to work.

vedant

I think I get it now, thanks for explaining!

I don’t totally follow the proposed encoding change & would appreciate a small example.

Is the idea to replace e.g. an ‘AT_low_pc () + relocation for ’ with an ‘AT_low_pc ( + offset)’,

With Split DWARF or with DWARFv5 in LLVM at the moment, all addresses are indirected already. So it’s:

Replace “AT_low_pc ()” with an “AT_low_pc ( + offset)”.

s.t. the cost of a relocation for the address is paid down the more it’s used?

Right - specifically to reduce the pool of addresses down to, ideally, one address per section/indivisible chunk of machine code (per subsection in MachO, for instance) (whereas currently there are many addresses per section)

How do you figure the offset out?

Label difference - same as is done for DW_AT_high_pc today in DWARFv4 and DWARFv5 in LLVM. high_pc currently uses the low_pc addresse to be relative to, in this proposed situation, we’d use a symbol that’s in the first bit of debug info in the section (or subsection in MachO). So the low_pc of the subprogram/function, for instance, or if there are two functions in the same section with debug info for both, the low_pc of the first of those functions, etc…

If the label difference in a low_pc attribute is relative to the start of a section, could a linker orderfile pass break the dwarf unless it updates the offset?

Nah - terminologically, ELF sections are indivisible - more akin to MachO subsections. ELF files can have multiple sections with the same name (as is used for comdat sections for inline functions, and for -ffunction-sections (roughly equivalent to MachO’s “subsections via symbols”, as I understand it) (or can use “.text.suffix” naming to give each separate .text section its own name - but the linker strips the suffixes and concatenates all these together into the final linked .text section)

Ditto, I suppose, for an intra-function offset when something like propeller is used to reorder basic blocks (I’m thinking of At_call_return_pc now).

Yeah - currently the “base address” for each section is determined by the first function with debug info being emitted in that section ( https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/AsmPrinter/DwarfDebug.cpp#L1787 ) - with PROPELLER we’d need to add similar code when function fragments are emitted. (I’m planning to check the PROPELLER work in progress tree soon and do another sanity pass over the debug info emitted to check this is working as intended - in part because this base address selection, coupled with DWARFv5 and maybe with the changes I’m suggesting in this thread (& will commit under flags “soon” (might take me a week or two judging by my review/bug/investigation load right now… fingers crossed)) might make PROPELLER less expensive in terms of debug info size, or more expensive relative to the significant improvements this provides)

Owing to the way MachO debug info distribution works differently & if I understand correctly doesn’t need relocations in many cases due to DWARF-aware parsing/linking (& if it does use relocations, I’ve no knowledge of when/how and how big they are compared to the ELF relocations I’ve been measuring) it’s quite possible MachO would have different tradeoffs in this space.

I think I get it now, thanks for explaining!

I don’t totally follow the proposed encoding change & would appreciate a small example.

Is the idea to replace e.g. an ‘AT_low_pc () + relocation for ’ with an ‘AT_low_pc ( + offset)’,

With Split DWARF or with DWARFv5 in LLVM at the moment, all addresses are indirected already. So it’s:

Replace “AT_low_pc ()” with an “AT_low_pc ( + offset)”.

s.t. the cost of a relocation for the address is paid down the more it’s used?

Right - specifically to reduce the pool of addresses down to, ideally, one address per section/indivisible chunk of machine code (per subsection in MachO, for instance) (whereas currently there are many addresses per section)

How do you figure the offset out?

Label difference - same as is done for DW_AT_high_pc today in DWARFv4 and DWARFv5 in LLVM. high_pc currently uses the low_pc addresse to be relative to, in this proposed situation, we’d use a symbol that’s in the first bit of debug info in the section (or subsection in MachO). So the low_pc of the subprogram/function, for instance, or if there are two functions in the same section with debug info for both, the low_pc of the first of those functions, etc…

If the label difference in a low_pc attribute is relative to the start of a section, could a linker orderfile pass break the dwarf unless it updates the offset?

Nah - terminologically, ELF sections are indivisible - more akin to MachO subsections. ELF files can have multiple sections with the same name (as is used for comdat sections for inline functions, and for -ffunction-sections (roughly equivalent to MachO’s “subsections via symbols”, as I understand it) (or can use “.text.suffix” naming to give each separate .text section its own name - but the linker strips the suffixes and concatenates all these together into the final linked .text section)

I see, so an ELF linker may reorder sections relative to each other, but not the contents of a section. (That matches up with what I’ve read elsewhere - you’d use -ffunction-sections to reorder function symbols, IIRC.)

And in this proposal to increase address pool reuse, label differences in a MachO would be relative to the subsection. In Propeller, is basic block reordering done after a .o is emitted? If so, I suppose I don’t yet see how the proposed scheme is resilient to this reordering. OTOH if block reordering is done just before the label difference is evaluated, then there shouldn’t be any issue.

Ditto, I suppose, for an intra-function offset when something like propeller is used to reorder basic blocks (I’m thinking of At_call_return_pc now).

Yeah - currently the “base address” for each section is determined by the first function with debug info being emitted in that section ( https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/AsmPrinter/DwarfDebug.cpp#L1787 ) - with PROPELLER we’d need to add similar code when function fragments are emitted. (I’m planning to check the PROPELLER work in progress tree soon and do another sanity pass over the debug info emitted to check this is working as intended - in part because this base address selection, coupled with DWARFv5 and maybe with the changes I’m suggesting in this thread (& will commit under flags “soon” (might take me a week or two judging by my review/bug/investigation load right now… fingers crossed)) might make PROPELLER less expensive in terms of debug info size, or more expensive relative to the significant improvements this provides)

Thanks for investigating!

Owing to the way MachO debug info distribution works differently & if I understand correctly doesn’t need relocations in many cases due to DWARF-aware parsing/linking (& if it does use relocations, I’ve no knowledge of when/how and how big they are compared to the ELF relocations I’ve been measuring) it’s quite possible MachO would have different tradeoffs in this space.

A linked .dSYM (analogous to an ELF .dwp, IIUC) doesn’t contain relocations for AT_low_pc or AT_call_return_pc in the simple examples I tried out. We do emit relocations for those attributes in MachO object files (there isn’t something analogous to a .dwo on MachO, the debug info just goes into a different set of sections in the .o). My understanding (based on the definition of macho_relocation_info in the ld64 sources) is that MachO relocations are 8 bytes in size. It looks like ELF rel/rela relocations are 16/24 bytes in size, but I’m not sure why (perhaps they’re more extensible / encode more information).

Would a vanilla DWARFv4 .dwp (without your patches applied) contain a relocation for each ‘AT_low_pc ()’?

vedant

I think I get it now, thanks for explaining!

I don’t totally follow the proposed encoding change & would appreciate a small example.

Is the idea to replace e.g. an ‘AT_low_pc () + relocation for ’ with an ‘AT_low_pc ( + offset)’,

With Split DWARF or with DWARFv5 in LLVM at the moment, all addresses are indirected already. So it’s:

Replace “AT_low_pc ()” with an “AT_low_pc ( + offset)”.

s.t. the cost of a relocation for the address is paid down the more it’s used?

Right - specifically to reduce the pool of addresses down to, ideally, one address per section/indivisible chunk of machine code (per subsection in MachO, for instance) (whereas currently there are many addresses per section)

How do you figure the offset out?

Label difference - same as is done for DW_AT_high_pc today in DWARFv4 and DWARFv5 in LLVM. high_pc currently uses the low_pc addresse to be relative to, in this proposed situation, we’d use a symbol that’s in the first bit of debug info in the section (or subsection in MachO). So the low_pc of the subprogram/function, for instance, or if there are two functions in the same section with debug info for both, the low_pc of the first of those functions, etc…

If the label difference in a low_pc attribute is relative to the start of a section, could a linker orderfile pass break the dwarf unless it updates the offset?

Nah - terminologically, ELF sections are indivisible - more akin to MachO subsections. ELF files can have multiple sections with the same name (as is used for comdat sections for inline functions, and for -ffunction-sections (roughly equivalent to MachO’s “subsections via symbols”, as I understand it) (or can use “.text.suffix” naming to give each separate .text section its own name - but the linker strips the suffixes and concatenates all these together into the final linked .text section)

I see, so an ELF linker may reorder sections relative to each other, but not the contents of a section. (That matches up with what I’ve read elsewhere - you’d use -ffunction-sections to reorder function symbols, IIRC.)

Right.

And in this proposal to increase address pool reuse, label differences in a MachO would be relative to the subsection.

Even before my proposal, there are already many cases where rnglists and loclists in DWARFv5 (& location lists in DWARFv4) will use selectively chosen base addresses and symbol differences as often as possible (insofar as I could do that when working/experimenting with ELF).

So without function sections, for instance - rnglists for sub-function ranges (ignoring PROPELLER for now/in this part of the discussion).

Perhaps an example would be helpful. Here’s LLVM’s current behavior with DWARFv5 and ELF, without function sections:

int f1();
void f2() {
if (int i = f1()) {
f1();
}
}
void f3() {
if (f1()) {
int i = f1();
}
}
attribute((section(“.other”))) void f4() {
}

In this code there are only two ELF sections (“.text” contains the definitions of f2 and f3, “.other” contains the definition of f4) and so we /should/ be able to only have 2 relocations in the debug info.

(I’m exploiting something of a bug/quirk in Clang/LLVM’s debug info that causes, even at -O0, the lexical_block for the ‘if’ to have a hole in it, where the call to f1 is, so it has ranges rather than low/high pc)

In DWARFv4 this example would’ve used 10 relocations. (on the CU ranges, there would be begin/end for the “.text” range covering f2 and f3, and begin/end for the “.other” range covering f4, then the range list for the “if” lexical_block would contain another 2 pairs (4 addresses/relocations), one relocation for f2’s low_pc, one for f3’s ‘if’ lexical_block).

In DWARFv5, we see the following:

0x00000014: DW_RLE_base_addressx: 0x0000000000000000
0x00000016: [DW_RLE_offset_pair ]: 0x0000000000000008, 0x0000000000000014
0x00000019: [DW_RLE_offset_pair ]: 0x000000000000001a, 0x000000000000001f
0x0000001c: [DW_RLE_end_of_list ]
0x0000001d: [DW_RLE_startx_length]: 0x0000000000000000, 0x0000000000000036
0x00000020: [DW_RLE_startx_length]: 0x0000000000000002, 0x0000000000000006
0x00000023: [DW_RLE_end_of_list ]

The first location list is for the ‘if’ scope, the second is for the CU. Both are able to efficiently select encodings and base addresses.

But the debug_addr has 4 addresses in it - the address at index 1 (not used in the rnglists shown above - we see index 0 and index 2 are used there) is for the low_pc of f3’s subprogram, and the address at index 2 is for the low_pc of f3’s if block/scope.

That’s the address/relocation that would be… addressed by the change I’m proposing. One way to avoid that relocation would be to encode f3’s address range using a rnglist - this is fully backwards compatible, and would produce a rnglist like this:

[DW_RLE_offset_pair ]: 0x0000000000000030, 0x0000000000000036
[DW_RLE_end_of_list ]

Similarly, f3’s if block could use a rangelist like:

[DW_RLE_offset_pair ]: 0x0000000000000046, 0x0000000000000054
[DW_RLE_end_of_list ]

As you can imagine, there are quite a few ranges (especially once you get inlining) that use low/high_pc, and could benefit from the reduction in relocations by using this strategy. Though it isn’t optimal (the range list encoding isn’t intended to be good for this use case) in terms of size cost - hence the possibility of using DWARF expressions for address class attributes, or a custom form that would more directly encode the + .

In Propeller, is basic block reordering done after a .o is emitted?

Yes.

If so, I suppose I don’t yet see how the proposed scheme is resilient to this reordering.

With PROPELLER any function that is fragmented into reorderable sections must necessarily use ranges to describe the function’s address range - but, again, choosing base addresses strategically & using relative references whenever possible, would help reduce the cost of PROPELLER’s debug info.

OTOH if block reordering is done just before the label difference is evaluated, then there shouldn’t be any issue.

Ditto, I suppose, for an intra-function offset when something like propeller is used to reorder basic blocks (I’m thinking of At_call_return_pc now).

Yeah - currently the “base address” for each section is determined by the first function with debug info being emitted in that section ( https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/AsmPrinter/DwarfDebug.cpp#L1787 ) - with PROPELLER we’d need to add similar code when function fragments are emitted. (I’m planning to check the PROPELLER work in progress tree soon and do another sanity pass over the debug info emitted to check this is working as intended - in part because this base address selection, coupled with DWARFv5 and maybe with the changes I’m suggesting in this thread (& will commit under flags “soon” (might take me a week or two judging by my review/bug/investigation load right now… fingers crossed)) might make PROPELLER less expensive in terms of debug info size, or more expensive relative to the significant improvements this provides)

Thanks for investigating!

Owing to the way MachO debug info distribution works differently & if I understand correctly doesn’t need relocations in many cases due to DWARF-aware parsing/linking (& if it does use relocations, I’ve no knowledge of when/how and how big they are compared to the ELF relocations I’ve been measuring) it’s quite possible MachO would have different tradeoffs in this space.

A linked .dSYM (analogous to an ELF .dwp, IIUC) doesn’t contain relocations for AT_low_pc or AT_call_return_pc in the simple examples I tried out. We do emit relocations for those attributes in MachO object files (there isn’t something analogous to a .dwo on MachO, the debug info just goes into a different set of sections in the .o). My understanding (based on the definition of macho_relocation_info in the ld64 sources) is that MachO relocations are 8 bytes in size. It looks like ELF rel/rela relocations are 16/24 bytes in size, but I’m not sure why (perhaps they’re more extensible / encode more information).

OK nod with the smaller encoding it may be less of a pressing issue for you & the tradeoff may be different.

Would a vanilla DWARFv4 .dwp (without your patches applied) contain a relocation for each ‘AT_low_pc ()’?

DWP files contain no direct addresses - they are all indirect through the address pool. But, yes, for a DWARFv4 Split DWARF build, low_pcs don’t have an opportunity to reuse a strategically chosen base address - they have to use an addrx form & the debug_addr section would have that specific address with a relocation for it.

Coming back around to this…

https://github.com/llvm/llvm-project/commit/ad18b075fd63935148b460f9c6b4dce130c56b15 Added the “always use ranges” option, currently off-by-default, usable with -gdwarf-5 -mllvm -always-use-ranges-in-v5=Enable (as the name implies, this has no effect on DWARFv4 and below, because there’s no benefit there). I have plans to make this the default behavior for Split DWARF since moving bytes from .o to .dwo is valuable even if it breaks pretty even - enough to justify this even though it’s a wash or maybe a slight cost to linked binary size (compared to unlinked object size).

I did come across a couple of lldb bugs related to using ranges on subprograms (“Ranges everywhere” can use ranges on subprograms where the subprogram is in the same section as another subprogram), sent fixes for them in: https://reviews.llvm.org/D94063 and https://reviews.llvm.org/D94064 - if anyone has a chance to look at those, it’d be most appreciated.

Once those lldb fixes are in, I’ll make the change to enable this feature by default when using Split DWARF unless anyone’s got objections to that.

& in the mean time I’m also working on patches for the other two candidates - novel DWARF expressions and an LLVM extension form.

All 3 options are now implemented & I’ve tidied up a flag name (still an -mllvm flag - I don’t think this should ever be a user-visible flag).

-mllvm -minimize-addr-in-v5=Ranges
Uses debug_rnglists even for contiguous ranges if doing so would avoid adding another entry to .debug_addr eg: a CU with 3 functions, two in the same section. The first function in each section uses low/high, the CU has a rnglist, and can share/reuse the low_pc of those two functions. But for a function that is later in a section that already has another function in it - that one would use the low_pc of the first function in the section as its base address, and an offset pair - avoiding the need for a 3rd debug_addr entry and associated relocation

-mllvm -minimize-addr-in-v5=Expressions
This uses the exprloc idea - using a non-trivial expression for a DW_AT_low_pc or other address classed attribute. This reduces the overhead compared to the ‘Ranges’ technique, and allows more cases - including DW_TAG_labels and DW_TAG_call_sites.

-mllvm -minimize-addr-in-v5=Form
Similar to Expressions, but using a custom form to make things a bit more compact (has the drawback that consumers who don’t recognize the form can’t parse any of the DWARF because they can’t skip over the attribute due to not knowing its size)

For comparisons, a few different build modes using ‘Ranges’:

I should say all these builds are with compressed debug info enabled (in object files) and type units. the asan build uses compressed debug info in the linked binary and only gmlt.

But the main takeaway is this seems probably (to me) worth turning on for Split DWARF - it does mean the final build assets (exe+dwp) are slightly larger (1.28%), but the benefit in object and executable size seems probably generically worthwhile.

I plan to roll =Ranges out inside google for cases that use Split DWARF, see if sticks, and if so, change upstream to default to enable the feature under Split DWARF.

For the other two modes generally make things better/reduce the tradeoff cost:

So with the custom form, we can even get to a total savings in both intermediate (.o/.dwo) and linked (exe/dwp) files, so it might even be applicable to non-split DWARF. (though, again, the tradeoffs will look somewhat different without compression enabled and maybe without type units might swing it one way or another a bit (probably not much though))

I’d love to have the Form version supported in lldb and enabled by default when tuning/targeting lldb, but not sure I have the lldb expertise/time to implement that just yet.

Anyone have thoughts/ideas/interest in collaborating on any of this?

Hi, David, this looks great! I just started to play this under llc
-minimize-addr-in-v5= and I will study it in the coming days.

All 3 options are now implemented & I've tidied up a flag name (still an
-mllvm flag - I don't think this should ever be a user-visible flag).

-mllvm -minimize-addr-in-v5=Ranges
Uses debug_rnglists even for contiguous ranges if doing so would avoid
adding another entry to .debug_addr eg: a CU with 3 functions, two in the
same section. The first function in each section uses low/high, the CU has
a rnglist, and can share/reuse the low_pc of those two functions. But for a
function that is later in a section that already has another function in it
- that one would use the low_pc of the first function in the section as its
base address, and an offset pair - avoiding the need for a 3rd debug_addr
entry and associated relocation

-mllvm -minimize-addr-in-v5=Expressions
This uses the exprloc idea - using a non-trivial expression for a
DW_AT_low_pc or other address classed attribute. This reduces the overhead
compared to the 'Ranges' technique, and allows more cases - including
DW_TAG_labels and DW_TAG_call_sites.

This option emits: DW_OP_addrx 0, DW_OP_const4u 9, DW_OP_plus.

DW_OP_const4u is a bit wasteful. This could be changed to DW_OP_addrx 0,
DW_OP_plus_udata 9. However, the current implementation requires the size of the
DWARF expression, and we don't know the addend size of DW_OP_plus_udata.

   .byte size_of_exprloc # This would be dependent on the size of .uleb128
   ...
   .byte 35
   .long .Ltmp1-.Lfunc_begin0
   # it'd be nice if we can use .uleb128 .Ltmp1-.Lfunc_begin0

size_of_exprloc could be changed to a subtraction of two labels.

When .uleb128 is used, we should be careful about assembler convergence.

* GNU as hacked around the problem specifically for .gcc_except_table by inserting additional .align 4029 – relax_segment can't stabilize .gcc_except_table It works for .gcc_except_table but can be a problem for our .uleb128 + .byte scheme.
* LLVM MC's solution is generic.

-mllvm -minimize-addr-in-v5=Form
  Similar to Expressions, but using a custom form to make things a bit
more compact (has the drawback that consumers who don't recognize the form
can't parse any of the DWARF because they can't skip over the attribute due
to not knowing its size)

This option emits a new form: DW_FORM_LLVM_addrx_offset, which is the composite
of DW_FORM_addrx and DW_FORM_data4. This is superior to Expressions because the
bytes for the exprloc size and the plus operation can be saved.

Similar to Expressions, there is a question whether DW_FORM_udata would be better.
It could save 3 bytes compare with DW_OP_plus_udata.

Hi, David, this looks great! I just started to play this under llc
-minimize-addr-in-v5= and I will study it in the coming days.

>All 3 options are now implemented & I've tidied up a flag name (still an
>-mllvm flag - I don't think this should ever be a user-visible flag).
>
>-mllvm -minimize-addr-in-v5=Ranges
> Uses debug_rnglists even for contiguous ranges if doing so would avoid
>adding another entry to .debug_addr eg: a CU with 3 functions, two in the
>same section. The first function in each section uses low/high, the CU has
>a rnglist, and can share/reuse the low_pc of those two functions. But for a
>function that is later in a section that already has another function in it
>- that one would use the low_pc of the first function in the section as its
>base address, and an offset pair - avoiding the need for a 3rd debug_addr
>entry and associated relocation
>
>-mllvm -minimize-addr-in-v5=Expressions
> This uses the exprloc idea - using a non-trivial expression for a
>DW_AT_low_pc or other address classed attribute. This reduces the overhead
>compared to the 'Ranges' technique, and allows more cases - including
>DW_TAG_labels and DW_TAG_call_sites.

This option emits: DW_OP_addrx 0, DW_OP_const4u 9, DW_OP_plus.

DW_OP_const4u is a bit wasteful.

In short, this is consistent with how we encode instruction sequence
lengths in other places in LLVM today. (eg: DW_AT_high_pc could be
DW_FORM_udata, but we use DW_FORM_data4).

There's been some argument that using fixed-width forms improves DWARF
parsing performance significantly, but that idea's probably gone out
the window lately with exprloc (well, I guess we used 'blockN' before
that, which is also variable length, even if it might have a fixed
length length field to start with) and addrx forms (though we do use
fixed with strx forms (though that would mean more abbreviations - a
DW_TAG_subprogram with a low-indexed name would have a different form
for the DW_AT_name than one with a high-indexed name that needed more
bytes to encode).

This could be changed to DW_OP_addrx 0,
DW_OP_plus_udata 9. However, the current implementation requires the size of the
DWARF expression, and we don't know the addend size of DW_OP_plus_udata.

Right - and that requirement for the current implementation is pretty
deeply embedded - we need to know the length of attributes so we know
the length of the DIEs that contain them so we know the offsets of
those DIEs so we can encode those offsets when doing DIE-to-DIE
references, etc. Pavel had a proposal a year or two ago about
potentially moving away from this and using symbolic references, label
differences, etc to do DIE offsets - as it'd make the resulting DWARF
assembly more legible and modifiable, but no one's taken that up as
yet (not sure if Pavel/others tried and hit any fundamental blockers)
and might have some performance tradeoffs, etc.

   .byte size_of_exprloc # This would be dependent on the size of .uleb128
   ...
   .byte 35
   .long .Ltmp1-.Lfunc_begin0
   # it'd be nice if we can use .uleb128 .Ltmp1-.Lfunc_begin0

size_of_exprloc could be changed to a subtraction of two labels.

When .uleb128 is used, we should be careful about assembler convergence.

* GNU as hacked around the problem specifically for .gcc_except_table by inserting additional .align 4029 – relax_segment can't stabilize .gcc_except_table It works for .gcc_except_table but can be a problem for our .uleb128 + .byte scheme.
* LLVM MC's solution is generic.

>-mllvm -minimize-addr-in-v5=Form
> Similar to Expressions, but using a custom form to make things a bit
>more compact (has the drawback that consumers who don't recognize the form
>can't parse any of the DWARF because they can't skip over the attribute due
>to not knowing its size)

This option emits a new form: DW_FORM_LLVM_addrx_offset, which is the composite
of DW_FORM_addrx and DW_FORM_data4. This is superior to Expressions because the
bytes for the exprloc size and the plus operation can be saved.

Similar to Expressions, there is a question whether DW_FORM_udata would be better.
It could save 3 bytes compare with DW_OP_plus_udata.

Yep, see the general discussion on that above.

Though there is some question here about what FORM we could/would
actually propose to standardize in DWARF. GCC looks like they use
address sized instruction sequence lengths like DW_AT_high_pc (data4
in 32bit builds, data8 in 64 bit builds) and LLVM always uses data4 (I
implemented that based on GCC's behavior - guess I didn't look too
closely at the 32/64 bit aspect, or perhaps GCC's behavior changed
since I implemented LLVM's).

GCC always uses addrx (doesn't use any addrxN encodings), like LLVM.
(interestingly GCC also always uses strx, never a strxN encoding)

So at least for LLVM and GCC's current behavior, having a
DW_FORM_addrx_data4 and DW_FORM_addrx_data8 would be consistent. But
do we end up proposing a full matrix of
DW_FORM_addrx{,1,2,3,4}_{udata,data{1,2,4,8,16}} ? That'd be
unfortunate for the DW_FORM space, there aren't any other instances of
such combinatorial explosion in form types.

It'd probably be good to have the hypothetical ideal
DW_FORM_addrx_udata standardized, in addition to at least the
addrx_data4 and addrx_data8 form even if no one's going to use it
right away. But it'll probably be an open discussion with the DWARF
committee about what/how they might see this being standardized.

- Dave