[RFC] Segmented Address Space Support in LLDB

We want to support segmented address spaces in LLDB. Currently, all of LLDB’s external API, command line interface, and internals assume that an address in memory can be addressed unambiguously as an addr_t (aka uint64_t). To support a segmented address space we’d need to extend addr_t with a discriminator (an aspace_t) to uniquely identify a location in memory. This RFC outlines what would need to change and how we propose to do that.

Addresses in LLDB

Currently, LLDB has two ways of representing an address:

  • Address object. Mostly represents addresses as Section+offset for a binary image loaded in the Target. An Address in this form can persist across executions, e.g. an address breakpoint in a binary image that loads at a different address every execution. An Address object can represent memory not mapped to a binary image. Heap, stack, jitted items, will all be represented as the uint64_t load address of the object, and cannot persist across multiple executions. You must have the Target object available to get the current load address of an Address object in the current process run. Some parts of lldb do not have a Target available to them, so they require that the Address can be devolved to an addr_t (aka uint64_t) and passed in.
  • The addr_t (aka uint64_t) type. Primarily used when receiving input (e.g. from a user on the command line) or when interacting with the inferior (reading/writing memory) for addresses that need not persist across runs. Also used when reading DWARF and in our symbol tables to represent file offset addresses, where the size of an Address object would be objectionable.

Proposal

Address + ProcessAddress

  • The Address object gains a segment discriminator member variable. Everything that creates an Address will need to provide this segment discriminator.
  • A ProcessAddress object which is a uint64_t and a segment discriminator as a replacement for addr_t. ProcessAddress objects would not persist across multiple executions. Similar to how you can create an addr_t from an Address+Target today, you can create a ProcessAddress given an Address+Target. When we pass around addr_ts today, they would be replaced with ProcessAddress, with the exception of symbol tables where the added space would be significant, and we do not believe we need segment discriminators today.

Address Only

Extend the lldb_private::Address class to be the one representation of locations; including file based ones valid before running, file addresses resolved in a process, and process specific addresses (heap/stack/JIT code) that are only valid during a run. That is attractive because it would provide a uniform interface to any “where is something” question you would ask, either about symbols in files, variables in stack frames, etc.

At present, when we resolve a Section+Offset Address to a “load address” we provide a Target to the resolution API. Providing the Target externally makes sense because a Target knows whether the Section is present or not and can unambiguously return a load address. We could continue that approach since the Target always holds only one process, or extend it to allow passing in a Process when resolving non-file backed addresses. But this would make the conversion from addr_t uses to Address uses more difficult, since we will have to push the Target or Process into all the API’s that make use of just an addr_t. Using a single Address class seems less attractive when you have to provide an external entity to make sense of it at all the use sites.

We could improve this situation by including a Process (as a weak pointer) and fill that in on the boundaries where in the current code we go from an Address to a process specific addr_t. That would make the conversion easier, but add complexity. Since Addresses are ubiquitous, you won’t know what any given Address you’ve been handed actually contains. It could even have been resolved for another process than the current one. Making Address usage-dependent in this way reduces the attractiveness of the solution.

Approach

Replacing all the instances of addr_t by hand would be a lot of work. Therefore we propose writing a clang-based tool to automate this menial task. The tool would update function signatures and replace uses of addr_t inside those functions to get the addr_t from the ProcessAddress or Address and return the appropriate object for functions that currently return an addr_t. The goal of this tool is to generate one big NFC patch. This tool needs not be perfect, at some point it will be more work to improve the tool than fixing up the remaining code by hand. After this patch LLDB would still not really understand address spaces but it will have everything in place to support them.

Once all the APIs are updated, we can start working on the functional changes. This means actually interpreting the aspace_t values and making sure they don’t get dropped.

Finally, when all this work is done and we’re happy with the approach, we extend the SB API with overloads for the functions that currently take or return addr_t . I want to do this last so we have time to iterate before committing to a stable interface.

Testing

By splitting off the intrusive non-functional changes we are able to rely on the existing tests for coverage. Smaller functional changes can be tested in isolation, either through a unit test or a small GDB remote test. For end-to-end testing we can run the test suite with a modified debugserver that spoofs address spaces.

Thanks,
Jonas

We want to support segmented address spaces in LLDB. Currently, all of LLDB’s external API, command line interface, and internals assume that an address in memory can be addressed unambiguously as an addr_t (aka uint64_t). To support a segmented address space we’d need to extend addr_t with a discriminator (an aspace_t) to uniquely identify a location in memory. This RFC outlines what would need to change and how we propose to do that.

### Addresses in LLDB

Currently, LLDB has two ways of representing an address:

- Address object. Mostly represents addresses as Section+offset for a binary image loaded in the Target. An Address in this form can persist across executions, e.g. an address breakpoint in a binary image that loads at a different address every execution. An Address object can represent memory not mapped to a binary image. Heap, stack, jitted items, will all be represented as the uint64_t load address of the object, and cannot persist across multiple executions. You must have the Target object available to get the current load address of an Address object in the current process run. Some parts of lldb do not have a Target available to them, so they require that the Address can be devolved to an addr_t (aka uint64_t) and passed in.
- The addr_t (aka uint64_t) type. Primarily used when receiving input (e.g. from a user on the command line) or when interacting with the inferior (reading/writing memory) for addresses that need not persist across runs. Also used when reading DWARF and in our symbol tables to represent file offset addresses, where the size of an Address object would be objectionable.

Correction: LLDB has 3 kinds of uint64_t addresses:
- "file address" which are always mapped to a section + offset if put into a Address object. This value only makes sense to the lldb_private::Module that contains it. The only way to pass this around is as a lldb_private::Address. You can make queries on a file address using "image lookup --address" before you are debugging, but a single file address can result in multiple matches in multiple modules because each module might contain something at this virtual address. This object might be able to be converted to a "load address" if the section is loaded in your debug target. Since the target contains the section load list, the target is needed when converting between Address and addr_t objects.
- "load address" which is guaranteed to be unique in a process with no segments. It can always be put into a lldb_private::Address object, but that object won't always have a section. If there is no section, it means the memory location maps to stack, heap, or other memory that doesn't reside in a object file section. This object might be able to be converted to a section + offset address if the address matches one of the loaded sections in a target. If this can be converted to a Address object that has a section, then it can persist across debug sessions, otherwise, not.
- "host address" which is a pointer to memory in the LLDB process itself. Used for storing expression results and other things. You cannot convert this to/from a "file" or "load" address.

## Proposal

### Address + ProcessAddress

- The Address object gains a segment discriminator member variable. Everything that creates an Address will need to provide this segment discriminator.

So an interesting thing to think about is if lldb_private::Section object should contain a segment identifier? If this is the case, then an Address object can have a Section that has a segment _and_ the Address object itself might have one that was set from the section as well. It would be good to figure out what the rules are for this case and it might lead to the need for an intelligent accessor that always prefers the section's segment if a section is available. The Address object must have one in case we have a pointer to memory in data and there is no section for this (like any heap addresses).

- A ProcessAddress object which is a uint64_t and a segment discriminator as a replacement for addr_t. ProcessAddress objects would not persist across multiple executions. Similar to how you can create an addr_t from an Address+Target today, you can create a ProcessAddress given an Address+Target. When we pass around addr_ts today, they would be replaced with ProcessAddress, with the exception of symbol tables where the added space would be significant, and we do not believe we need segment discriminators today.

Would SegmentedAddress be a more descriptive name here?

A few things I would like to see on ProcessAddress or SegmentedAddress:
- Have a segment definition that says "no segment" like LLDB_INVALID_SEGMENT or LLDB_NO_SEGMENT and allow these objects to be constructed with just a lldb::addr_t and the segment gets auto set to LLDB_NO_SEGMENT
- Any code that uses these should test if there is no segment and continue to do what they used to do before
  - like read/write memory in ProcessGDBRemote
  - Anything that dumps one of these objects should dump just like they used to (just a uint64_t hex representation and no other notation)
- Add code that can convert a "load address" into a ProcessAddress or SegmentedAddress that invent the segment notation and have no changes for targets that don't support segmented address spaces
  - 0x1000 should convert to ProcessAddress where the address is 0x1000 and segment is LLDB_INVALID_SEGMENT or LLDB_NO_SEGMENT if the process doesn't support segmented addresses
  - 0x1000 would return an error on conversion for processes that do support segmented addresses as the segment must be specified? Or should there be a default segment if we run into this case?
  - Come up with some quick way to represent segmented addresses for an address of 0x1000 in segment 2: ideas:
    - [2]0x1000
    - {2}0x1000
    - 0x1000[2]
    - 0x1000{2}
    - {0x1000, 2}

### Address Only

Extend the lldb_private::Address class to be the one representation of locations; including file based ones valid before running, file addresses resolved in a process, and process specific addresses (heap/stack/JIT code) that are only valid during a run. That is attractive because it would provide a uniform interface to any “where is something” question you would ask, either about symbols in files, variables in stack frames, etc.

At present, when we resolve a Section+Offset Address to a “load address” we provide a Target to the resolution API. Providing the Target externally makes sense because a Target knows whether the Section is present or not and can unambiguously return a load address. We could continue that approach since the Target always holds only one process, or extend it to allow passing in a Process when resolving non-file backed addresses. But this would make the conversion from addr_t uses to Address uses more difficult, since we will have to push the Target or Process into all the API’s that make use of just an addr_t. Using a single Address class seems less attractive when you have to provide an external entity to make sense of it at all the use sites.

We could improve this situation by including a Process (as a weak pointer) and fill that in on the boundaries where in the current code we go from an Address to a process specific addr_t. That would make the conversion easier, but add complexity. Since Addresses are ubiquitous, you won’t know what any given Address you’ve been handed actually contains. It could even have been resolved for another process than the current one. Making Address usage-dependent in this way reduces the attractiveness of the solution.

## Approach

Replacing all the instances of addr_t by hand would be a lot of work. Therefore we propose writing a clang-based tool to automate this menial task. The tool would update function signatures and replace uses of addr_t inside those functions to get the addr_t from the ProcessAddress or Address and return the appropriate object for functions that currently return an addr_t. The goal of this tool is to generate one big NFC patch. This tool needs not be perfect, at some point it will be more work to improve the tool than fixing up the remaining code by hand. After this patch LLDB would still not really understand address spaces but it will have everything in place to support them.

This won't be NFC really as each location that plays with what used to be addr_t now must check if the segment is invalid before doing what it did before _and_ return an error if the segment is something valid.

It might be better to look at all of the APIs that could end up using a plain "addr_t" and adding new APIs that take a ProcessAddress and call the old API if the segment is LLDB_INVALID_SEGMENT or LLDB_NO_SEGMENT, and return an error if the segment is valid. For example in the Process class we have:

virtual size_t Process::DoReadMemory(lldb::addr_t vm_addr, void *buf, size_t size, Status &error) = 0;

We could add a new overload:

virtual size_t Process::DoReadMemory(ProcessAddress proc_addr, void *buf, size_t size, Status &error) {
  if (proc_addr.GetSegment() == LLDB_NO_SEGMENT)
    return DoReadMemory(proc_addr.GetAddress(), but, size, error);
  error.SetErrorString("segmented addresses are not supported on this process");
  return 0
}

Then we can start modifying the locations that need to support segmented addresses as needed. For instance, if we were to add segmented address support to ProcessGDBRemote, then we would override this function in that class.

I am not sure if slowly adding this functionality is better than replacing this all right away, but we can't just do a global replace without adding functionality or error checking IMHO.

Once all the APIs are updated, we can start working on the functional changes. This means actually interpreting the aspace_t values and making sure they don’t get dropped.

Finally, when all this work is done and we’re happy with the approach, we extend the SB API with overloads for the functions that currently take or return addr_t . I want to do this last so we have time to iterate before committing to a stable interface.

This might be one reason for doing the approach suggested above where we add new internal APIs that take a ProcessAddress and cut over to using them. As it would mean all of the current APIs in the lldb::SB layer would remain in place (they can't be removed) and would still make sense.

## Testing

By splitting off the intrusive non-functional changes we are able to rely on the existing tests for coverage. Smaller functional changes can be tested in isolation, either through a unit test or a small GDB remote test. For end-to-end testing we can run the test suite with a modified debugserver that spoofs address spaces.

That makes sense. ProcessGDBRemote will need to dynamically respond with wether it supports segmented addresses by overloading the DoReadMemory that takes a ProcessAddress and do the right thing.

Thanks for taking this on. I hope some of the comments above help moving this forward.

Greg

There's a lot of things that are unclear to me about this proposal. The mechanics of representing an segmented address are one thing, but I I think that the really interesting part will be the interaction with the rest of lldb. Like
- What's going to be the source of this address space information? Is it going to be statically baked into lldb (a function of the target architecture?), or dynamically retrieved from the target or platform we're debugging? How would that work?
- How is this going to interact with Object/SymbolFile classes? Are you expecting to use existing object and symbol formats for address space information, or some custom ones? AFAIK, none of the existing formats actually support encoding address space information (though that hasn't stopped people from trying).

Without understanding the bigger picture it's hard for me to say whether the proposed large scale refactoring is a good idea. Nonetheless, I am doubtful of the viability of that approach. Some of my reasons for that are:
- not all addr_ts represent an actual address -- sometimes that is a difference between two addresses, which still uses addr_t, as that's guaranteed to fit.
- relatedly to that, there is a difference (I'd expect) between the operations supported by the two types. addr_t supports all integral operations (though I hope we don't use all of them), but I wouldn't expect to be able to do the same with a SegmentedAddress. For one, I'd expect it wouldn't be possible to add two SegmentedAddresses together (which is possible for addr_t). OTOH, adding a SegmentedAddress and an addr_t would probably be fine? Would subtracting two SegmentedAddresses should result in an addr_t? But only if they have matching address spaces (and assert otherwise)?
- I'd also be worried about over-generalizing specialized code which can afford to work with plain addresses, and where the added address space would be a nuisance (or a source of bugs). E.g. ELF has no notion of address space, so I don't think I'd find it helpful to replace all plain integer calculations in elf parsing code with something more complex. (I'm aware that some people are using elf to encode address space information, but this is a pretty nonstandard extension, and it'd take more than type substitution to support anything like that.)
- large scale refactorings are very much not the norm in llvm

We want to support segmented address spaces in LLDB. Currently, all of LLDB’s external API, command line interface, and internals assume that an address in memory can be addressed unambiguously as an addr_t (aka uint64_t). To support a segmented address space we’d need to extend addr_t with a discriminator (an aspace_t) to uniquely identify a location in memory. This RFC outlines what would need to change and how we propose to do that.

### Addresses in LLDB

Currently, LLDB has two ways of representing an address:

- Address object. Mostly represents addresses as Section+offset for a binary image loaded in the Target. An Address in this form can persist across executions, e.g. an address breakpoint in a binary image that loads at a different address every execution. An Address object can represent memory not mapped to a binary image. Heap, stack, jitted items, will all be represented as the uint64_t load address of the object, and cannot persist across multiple executions. You must have the Target object available to get the current load address of an Address object in the current process run. Some parts of lldb do not have a Target available to them, so they require that the Address can be devolved to an addr_t (aka uint64_t) and passed in.
- The addr_t (aka uint64_t) type. Primarily used when receiving input (e.g. from a user on the command line) or when interacting with the inferior (reading/writing memory) for addresses that need not persist across runs. Also used when reading DWARF and in our symbol tables to represent file offset addresses, where the size of an Address object would be objectionable.

## Proposal

### Address + ProcessAddress

- The Address object gains a segment discriminator member variable. Everything that creates an Address will need to provide this segment discriminator.
- A ProcessAddress object which is a uint64_t and a segment discriminator as a replacement for addr_t. ProcessAddress objects would not persist across multiple executions. Similar to how you can create an addr_t from an Address+Target today, you can create a ProcessAddress given an Address+Target. When we pass around addr_ts today, they would be replaced with ProcessAddress, with the exception of symbol tables where the added space would be significant, and we do not believe we need segment discriminators today.

I'm strongly in favor of the first approach. The reason for that is that we have a lot of code that can only reasonable deal with one kind of an address, and I'd like to be able to express that in the type system. In fact, I think we could have more distinct types even now, but adding address spaces makes that even more important.

I agree with Pavel about the larger picture - we need to know the driver behind address spaces before we can discuss a workable solution.

I've dealt with 2 use cases - Harvard architecture cores, and low level hardware debugging.

A Harvard architecture core has separate instruction and data memories. These often use the same addresses, so to distinguish between them you need address spaces. The Motorola DSP56300 had 1 program and 2 data memories, called p, x and y. p:100, x:100 and y:100 were all separate memories, so "address 100" isn't enough to get what the user needed to see.

For low level hardware debugging (often using JTAG), many devices let you access memories in ways like "virtual using the TLB", or "virtual == physical, through the core", or "physical, through the SoC, not cached". Memory spaces, done right, can give the user the flexibility to pick how to view memory.

Are these the use cases you were envisioning, Jonas?

Hi Greg, Pavel.

I think it's worth saying that this is very early in this project. We know we're going to need the ability to track segments on addresses, but honestly a lot of the finer details aren't clear yet. It's such a fundamental change that we wanted to start a discussion, even though I know it's hard to have detailed discussions still.

In the envisioned environment, there will be a default segment, and most addresses will be in the default segment. DWARF, user input (lldb cmdline), SB API, and clang expressions are going to be the places where segments are specified --- Dump methods and ProcessGDBRemote will be the main place where the segments are displayed/used. There will be modifications to the memory read/write gdb RSP packets to include these.

This early in the project, it's hard to tell what will be upstreamed to the llvm.org monorepo, or when. My personal opinion is that we don't actually want to add segment support to llvm.org lldb at this point. We'd be initializing every address object with LLDB_INVALID_SEGMENT or LLDB_DEFAULT_SEGMENT, and then testing that each object is initialized this way? I don't see this actually being useful.

However, changing lldb's target addresses to be strictly handled in terms of objects will allow us to add a segment discriminator ivar to Address and ProcessAddress on our local branch while this is in development, and minimize the places where we're diverging from the llvm.org sources. We'll need to have local modifications at the places where a segment is input (DWARF, cmdline, SB API, compiler type) or output (Dump, ProcesssGDBRemote) and, hopefully, the vast majority of lldb can be unmodified.

The proposal was written in terms of what we need to accomplish based on our current understanding for this project, but I think there will be a lot of details figured out as we get more concrete experience of how this all works. And when it's appropriate to upstream to llvm.org, we'll be better prepared to discuss the tradeoffs of the approaches we took in extending Address/ProcessAddress to incorporate a segment.

My hope is that these generic OO'ification of target addresses will not change lldb beyond moving off of addr_t for now.

I included a couple of inlined comments, but I need to address more of yours & Pavel's notes later, I've been dealing with a few crazy things and am way behind on emails but didn't want to wait any longer to send something out.

We want to support segmented address spaces in LLDB. Currently, all of LLDB’s external API, command line interface, and internals assume that an address in memory can be addressed unambiguously as an addr_t (aka uint64_t). To support a segmented address space we’d need to extend addr_t with a discriminator (an aspace_t) to uniquely identify a location in memory. This RFC outlines what would need to change and how we propose to do that.

### Addresses in LLDB

Currently, LLDB has two ways of representing an address:

- Address object. Mostly represents addresses as Section+offset for a binary image loaded in the Target. An Address in this form can persist across executions, e.g. an address breakpoint in a binary image that loads at a different address every execution. An Address object can represent memory not mapped to a binary image. Heap, stack, jitted items, will all be represented as the uint64_t load address of the object, and cannot persist across multiple executions. You must have the Target object available to get the current load address of an Address object in the current process run. Some parts of lldb do not have a Target available to them, so they require that the Address can be devolved to an addr_t (aka uint64_t) and passed in.
- The addr_t (aka uint64_t) type. Primarily used when receiving input (e.g. from a user on the command line) or when interacting with the inferior (reading/writing memory) for addresses that need not persist across runs. Also used when reading DWARF and in our symbol tables to represent file offset addresses, where the size of an Address object would be objectionable.

Correction: LLDB has 3 kinds of uint64_t addresses:
- "file address" which are always mapped to a section + offset if put into a Address object. This value only makes sense to the lldb_private::Module that contains it. The only way to pass this around is as a lldb_private::Address. You can make queries on a file address using "image lookup --address" before you are debugging, but a single file address can result in multiple matches in multiple modules because each module might contain something at this virtual address. This object might be able to be converted to a "load address" if the section is loaded in your debug target. Since the target contains the section load list, the target is needed when converting between Address and addr_t objects.
- "load address" which is guaranteed to be unique in a process with no segments. It can always be put into a lldb_private::Address object, but that object won't always have a section. If there is no section, it means the memory location maps to stack, heap, or other memory that doesn't reside in a object file section. This object might be able to be converted to a section + offset address if the address matches one of the loaded sections in a target. If this can be converted to a Address object that has a section, then it can persist across debug sessions, otherwise, not.
- "host address" which is a pointer to memory in the LLDB process itself. Used for storing expression results and other things. You cannot convert this to/from a "file" or "load" address.

Yes, good point, host memory is a third type of address that we use. And our symbols tables, for instance, internally represent themselves as uint64_t offsets into the file or section, I forget which, and we're not talking about changing those uint64_t style addresses. On our project, I do not believe the symbol table will give us segment information.

## Proposal

### Address + ProcessAddress

- The Address object gains a segment discriminator member variable. Everything that creates an Address will need to provide this segment discriminator.

So an interesting thing to think about is if lldb_private::Section object should contain a segment identifier? If this is the case, then an Address object can have a Section that has a segment _and_ the Address object itself might have one that was set from the section as well. It would be good to figure out what the rules are for this case and it might lead to the need for an intelligent accessor that always prefers the section's segment if a section is available. The Address object must have one in case we have a pointer to memory in data and there is no section for this (like any heap addresses).

I don't believe a Section in this project will have a segment. We're looking purely at individual variables, primarily from debug information.

- A ProcessAddress object which is a uint64_t and a segment discriminator as a replacement for addr_t. ProcessAddress objects would not persist across multiple executions. Similar to how you can create an addr_t from an Address+Target today, you can create a ProcessAddress given an Address+Target. When we pass around addr_ts today, they would be replaced with ProcessAddress, with the exception of symbol tables where the added space would be significant, and we do not believe we need segment discriminators today.

Would SegmentedAddress be a more descriptive name here?

A few things I would like to see on ProcessAddress or SegmentedAddress:
- Have a segment definition that says "no segment" like LLDB_INVALID_SEGMENT or LLDB_NO_SEGMENT and allow these objects to be constructed with just a lldb::addr_t and the segment gets auto set to LLDB_NO_SEGMENT

- Any code that uses these should test if there is no segment and continue to do what they used to do before
- like read/write memory in ProcessGDBRemote

To be honest, testing this is going to be one of the tricky things I'm not sure how we'll do. we will have a default segment that addresses will use unless overridden, but how we spot places that *incorrectly* failed to initialize the segment of an Address/ProcessAddress is something we're going to need to figure out.

- Anything that dumps one of these objects should dump just like they used to (just a uint64_t hex representation and no other notation)
- Add code that can convert a "load address" into a ProcessAddress or SegmentedAddress that invent the segment notation and have no changes for targets that don't support segmented address spaces
- 0x1000 should convert to ProcessAddress where the address is 0x1000 and segment is LLDB_INVALID_SEGMENT or LLDB_NO_SEGMENT if the process doesn't support segmented addresses

- 0x1000 would return an error on conversion for processes that do support segmented addresses as the segment must be specified? Or should there be a default segment if we run into this case?
- Come up with some quick way to represent segmented addresses for an address of 0x1000 in segment 2: ideas:
   - [2]0x1000
   - {2}0x1000
   - 0x1000[2]
   - 0x1000{2}
   - {0x1000, 2}

To be honest, we haven't thought about the UI side of this very much yet. I think there will be ABI or ArchSpec style information that maps segment numbers to human-understandable names. It's ABI style enumerated numbers - the DWARF will include a number that is passed down to the remote gdb stub.

Hi Greg, Pavel.

I think it’s worth saying that this is very early in this project. We know we’re going to need the ability to track segments on addresses, but honestly a lot of the finer details aren’t clear yet. It’s such a fundamental change that we wanted to start a discussion, even though I know it’s hard to have detailed discussions still.

In the envisioned environment, there will be a default segment, and most addresses will be in the default segment. DWARF, user input (lldb cmdline), SB API, and clang expressions are going to be the places where segments are specified — Dump methods and ProcessGDBRemote will be the main place where the segments are displayed/used. There will be modifications to the memory read/write gdb RSP packets to include these.

This early in the project, it’s hard to tell what will be upstreamed to the llvm.org monorepo, or when. My personal opinion is that we don’t actually want to add segment support to llvm.org lldb at this point. We’d be initializing every address object with LLDB_INVALID_SEGMENT or LLDB_DEFAULT_SEGMENT, and then testing that each object is initialized this way? I don’t see this actually being useful.

However, changing lldb’s target addresses to be strictly handled in terms of objects will allow us to add a segment discriminator ivar to Address and ProcessAddress on our local branch while this is in development, and minimize the places where we’re diverging from the llvm.org sources. We’ll need to have local modifications at the places where a segment is input (DWARF, cmdline, SB API, compiler type) or output (Dump, ProcesssGDBRemote) and, hopefully, the vast majority of lldb can be unmodified.

The proposal was written in terms of what we need to accomplish based on our current understanding for this project, but I think there will be a lot of details figured out as we get more concrete experience of how this all works. And when it’s appropriate to upstream to llvm.org, we’ll be better prepared to discuss the tradeoffs of the approaches we took in extending Address/ProcessAddress to incorporate a segment.

My hope is that these generic OO’ification of target addresses will not change lldb beyond moving off of addr_t for now.

I included a couple of inlined comments, but I need to address more of yours & Pavel’s notes later, I’ve been dealing with a few crazy things and am way behind on emails but didn’t want to wait any longer to send something out.

No worries! I would vote to upstream as much as possible as soon as possible to avoid differences and merging issues for you guys. I would really like to see LLDB have support for segmented address spaces. Many comments I made were just my thinking out loud and trying to ease the changes in with as little disruption as possible.

We want to support segmented address spaces in LLDB. Currently, all of LLDB’s external API, command line interface, and internals assume that an address in memory can be addressed unambiguously as an addr_t (aka uint64_t). To support a segmented address space we’d need to extend addr_t with a discriminator (an aspace_t) to uniquely identify a location in memory. This RFC outlines what would need to change and how we propose to do that.

Addresses in LLDB

Currently, LLDB has two ways of representing an address:

  • Address object. Mostly represents addresses as Section+offset for a binary image loaded in the Target. An Address in this form can persist across executions, e.g. an address breakpoint in a binary image that loads at a different address every execution. An Address object can represent memory not mapped to a binary image. Heap, stack, jitted items, will all be represented as the uint64_t load address of the object, and cannot persist across multiple executions. You must have the Target object available to get the current load address of an Address object in the current process run. Some parts of lldb do not have a Target available to them, so they require that the Address can be devolved to an addr_t (aka uint64_t) and passed in.
  • The addr_t (aka uint64_t) type. Primarily used when receiving input (e.g. from a user on the command line) or when interacting with the inferior (reading/writing memory) for addresses that need not persist across runs. Also used when reading DWARF and in our symbol tables to represent file offset addresses, where the size of an Address object would be objectionable.

Correction: LLDB has 3 kinds of uint64_t addresses:

  • “file address” which are always mapped to a section + offset if put into a Address object. This value only makes sense to the lldb_private::Module that contains it. The only way to pass this around is as a lldb_private::Address. You can make queries on a file address using “image lookup --address” before you are debugging, but a single file address can result in multiple matches in multiple modules because each module might contain something at this virtual address. This object might be able to be converted to a “load address” if the section is loaded in your debug target. Since the target contains the section load list, the target is needed when converting between Address and addr_t objects.
  • “load address” which is guaranteed to be unique in a process with no segments. It can always be put into a lldb_private::Address object, but that object won’t always have a section. If there is no section, it means the memory location maps to stack, heap, or other memory that doesn’t reside in a object file section. This object might be able to be converted to a section + offset address if the address matches one of the loaded sections in a target. If this can be converted to a Address object that has a section, then it can persist across debug sessions, otherwise, not.
  • “host address” which is a pointer to memory in the LLDB process itself. Used for storing expression results and other things. You cannot convert this to/from a “file” or “load” address.

Yes, good point, host memory is a third type of address that we use. And our symbols tables, for instance, internally represent themselves as uint64_t offsets into the file or section, I forget which, and we’re not talking about changing those uint64_t style addresses. On our project, I do not believe the symbol table will give us segment information.

You should be able to classify symbols from the symbol table to a segment though right?

We could add a special define for host addresses if needed.

Proposal

Address + ProcessAddress

  • The Address object gains a segment discriminator member variable. Everything that creates an Address will need to provide this segment discriminator.

So an interesting thing to think about is if lldb_private::Section object should contain a segment identifier? If this is the case, then an Address object can have a Section that has a segment and the Address object itself might have one that was set from the section as well. It would be good to figure out what the rules are for this case and it might lead to the need for an intelligent accessor that always prefers the section’s segment if a section is available. The Address object must have one in case we have a pointer to memory in data and there is no section for this (like any heap addresses).

I don’t believe a Section in this project will have a segment. We’re looking purely at individual variables, primarily from debug information.

So if you have a global variable, it will have a symbol right? And it will have debug info. Are you saying that only the debug info would have segment info? It seems important to be able to view a global variable without debug info.

  • A ProcessAddress object which is a uint64_t and a segment discriminator as a replacement for addr_t. ProcessAddress objects would not persist across multiple executions. Similar to how you can create an addr_t from an Address+Target today, you can create a ProcessAddress given an Address+Target. When we pass around addr_ts today, they would be replaced with ProcessAddress, with the exception of symbol tables where the added space would be significant, and we do not believe we need segment discriminators today.

Would SegmentedAddress be a more descriptive name here?

A few things I would like to see on ProcessAddress or SegmentedAddress:

  • Have a segment definition that says “no segment” like LLDB_INVALID_SEGMENT or LLDB_NO_SEGMENT and allow these objects to be constructed with just a lldb::addr_t and the segment gets auto set to LLDB_NO_SEGMENT
  • Any code that uses these should test if there is no segment and continue to do what they used to do before
  • like read/write memory in ProcessGDBRemote

To be honest, testing this is going to be one of the tricky things I’m not sure how we’ll do. we will have a default segment that addresses will use unless overridden, but how we spot places that incorrectly failed to initialize the segment of an Address/ProcessAddress is something we’re going to need to figure out.

Tests I can think of:

  • read function disassembly from a code segment that would have the same address as something from a data segment
  • read a variable from a data segment that would have the same address as something from a code segment

It would be good to figure out where segments are going to come from. I would hope that some sections would be able to be mapped to certain segments so that we can live with a binary that has no debug info and still read say a global variable. I know we can do things right inside of the debug info since DWARF can have segment information.

  • Anything that dumps one of these objects should dump just like they used to (just a uint64_t hex representation and no other notation)
  • Add code that can convert a “load address” into a ProcessAddress or SegmentedAddress that invent the segment notation and have no changes for targets that don’t support segmented address spaces
  • 0x1000 should convert to ProcessAddress where the address is 0x1000 and segment is LLDB_INVALID_SEGMENT or LLDB_NO_SEGMENT if the process doesn’t support segmented addresses
  • 0x1000 would return an error on conversion for processes that do support segmented addresses as the segment must be specified? Or should there be a default segment if we run into this case?
  • Come up with some quick way to represent segmented addresses for an address of 0x1000 in segment 2: ideas:
  • [2]0x1000
  • {2}0x1000
  • 0x1000[2]
  • 0x1000{2}
  • {0x1000, 2}

To be honest, we haven’t thought about the UI side of this very much yet. I think there will be ABI or ArchSpec style information that maps segment numbers to human-understandable names. It’s ABI style enumerated numbers - the DWARF will include a number that is passed down to the remote gdb stub.

Should be fine to have named segments if needed, but we will need to come up with a way to specify a segment. We could always add new arguments to command line commands if needed.

Hi Greg, Pavel.

I think it's worth saying that this is very early in this project. We know we're going to need the ability to track segments on addresses, but honestly a lot of the finer details aren't clear yet. It's such a fundamental change that we wanted to start a discussion, even though I know it's hard to have detailed discussions still.

In the envisioned environment, there will be a default segment, and most addresses will be in the default segment. DWARF, user input (lldb cmdline), SB API, and clang expressions are going to be the places where segments are specified --- Dump methods and ProcessGDBRemote will be the main place where the segments are displayed/used. There will be modifications to the memory read/write gdb RSP packets to include these.

This early in the project, it's hard to tell what will be upstreamed to the llvm.org monorepo, or when. My personal opinion is that we don't actually want to add segment support to llvm.org lldb at this point. We'd be initializing every address object with LLDB_INVALID_SEGMENT or LLDB_DEFAULT_SEGMENT, and then testing that each object is initialized this way? I don't see this actually being useful.

However, changing lldb's target addresses to be strictly handled in terms of objects will allow us to add a segment discriminator ivar to Address and ProcessAddress on our local branch while this is in development, and minimize the places where we're diverging from the llvm.org sources. We'll need to have local modifications at the places where a segment is input (DWARF, cmdline, SB API, compiler type) or output (Dump, ProcesssGDBRemote) and, hopefully, the vast majority of lldb can be unmodified.

The proposal was written in terms of what we need to accomplish based on our current understanding for this project, but I think there will be a lot of details figured out as we get more concrete experience of how this all works. And when it's appropriate to upstream to llvm.org, we'll be better prepared to discuss the tradeoffs of the approaches we took in extending Address/ProcessAddress to incorporate a segment.

My hope is that these generic OO'ification of target addresses will not change lldb beyond moving off of addr_t for now.

I think that wrapping addr_t inside a class would be a nice change, even without the subsequent segmentification -- I'm hoping that this would add some type safety to the way we work with addresses (as we have various kinds of addresses that are all just plain ints). I'd like to see a concrete proposal for this class's interface though. (And I still remain mildly sceptical about automating this transition.)

To be honest, we haven't thought about the UI side of this very much yet. I think there will be ABI or ArchSpec style information that maps segment numbers to human-understandable names.

The details of this are pretty interesting for the Wasm use case, as it does not have a fixed number of segments/address spaces -- every module gets its own address space. I suppose the Wasm ArchSpec could just say it has UINT32_MAX address spaces, and then the dynamic loader would just assign modules into address spaces based on some key.

The interesting aspect here would be that the DWARF does *not* contain address space information here (as it's all in the same address space), so there may need to be a way for it to say "I don't actually know my address space -- I'll go whereever the dynamic loader puts me".

Still pretty early to determine that, but I'm mentioning this as it is the last use case of someone needing address space support in lldb (even though it's a slightly stranger form of address spaces).

pl

Hi all,

Just for the record, we have successfully implemented the wrapping of addr_t into a class to support multiple address spaces. The info about address space is stored in the ELF file, so we get the info from ELF parser and then pass it to the rest of the system. CLI/MI interface has been extended as well, so user can select with address space he wants for memory printing. Similarly, we patched expression evaluation, disassembler, etc.

If the address wrap is part of the upstream version, it will be awesome :-)...

Best regards.

Hi all,

Just for the record, we have successfully implemented the wrapping of addr_t into a class to support multiple address spaces. The info about address space is stored in the ELF file, so we get the info from ELF parser and then pass it to the rest of the system. CLI/MI interface has been extended as well, so user can select with address space he wants for memory printing. Similarly, we patched expression evaluation, disassembler, etc.

That's really interesting, I'm excited to hear that this is feasible and has been done before. Is this code available publicly and/or is this something you'd be willing to upstream (with our help)?

The code is not available at public repository, but we're more than happy to work on/help with the new proposal and share our findings or code snippets.