RFC: a new vectorized memory read packet

This is an RFC describing a new type of gdb-remote protocol packet, which
enables a debugger to request multiple memory reads from the inferior at once.

The main motivation is performance: if a debugger knows it will need to read
memory from multiple addresses, it can reduce the number of packets
communicated with the remote server by requesting all reads at once,
effectively reducing the number of packets communicated from 2 * num_reads to
two (the request and the reply). In high-bandwidth but high-latency networks,
this is a major speed-up.

We’ve found that, when debugging Swift programs using Swift’s concurrency
runtime, LLDB has to read up to two memory addresses per OS thread on every
stop. As such, on highly concurrent systems, latency for these reads can create
a visible slowdown in the debugging experience. This is particularly concerning
when doing step operations, where the debugger may privately stop the inferior
multiple times before a public stop is reached.

The new packet

Name: MultiMemRead
Request syntax: MultiMemRead:ranges:<base_addr1>,<num_bytes1>;<base_addr2>,<num_bytes2>;...;
Reply syntax: <num_bytes_read1>,<bytes1>;<num_bytes2>,<bytes2>...;

Each request supplies a base address to read from, and a number of bytes to
read. For each such request, the server replies with the number of bytes that
it was able to read, followed by the bytes read as binary data. Escaping of
special characters ( ^ $ } *) is handled like in the x packet.

There are two kinds of errors that must be handled by the packet:

  1. An error is returned for the entire packet; for example, the server does not
    support this packet.
  2. An error is returned for specific memory addresses; for example, if the
    server was unable to read memory from the second address provided.

The first kind can be represented by returning an E packet.

The second kind can be represented by returning zero for the number of bytes
read, for each address whose read operation failed.

Example

send: MultiMemRead:0x100200,4;0x200200,2;0x400000,4;
receive: 4,<binary encoding of abcd1000>;0,;2,<binary encoding of eeff>

The four bytes read from memory address 0x100200 were 0xabcd1000. No bytes
were read from 0x200200. Only two out of the four bytes requested from
0x400000 were read, and they were 0xeeff.

Question: Should the response be given as hexadecimal characters?

This seems wasteful, but are there benefits?

Question: Does the packet need options?

Would there be any use for an options = <something> component to the request?

Question: Should the packet allow for more expressive errors per address?

Instead of returning “zero bytes were read”, should the response contain an E
code in place of the <num_bytes>,<bytes> component of the reply?

LLDB API changes

When the inferior process is stopped and LLDB has to perform a memory read, the
existing Process::ReadMemory virtual method is responsible for checking a
per-stop memory cache before calling the abstract Process::DoReadMemory
method. If this address has not been read before, a large chunk of memory
around the address is read at once and saved in the cache, with the expectation
that nearby addresses may be requested by the debugger in this same stop.

Because caching is done in 512 bytes chunks, adopting a caching layer for the
MultiMemRead packet would make the reply packet exceedingly large. As such,
this RFC proposes bypassing the caching layer for such packets in an initial
implementation, leaving the door open for improvements in the future.

As a reminder, the signature of Process::ReadMemory is:

virtual size_t ReadMemory(
  lldb::addr_t vm_addr,
  void *buf,
  size_t size,
  Status &error);

The method is virtual because Process implementations can bypass memory
caching by overriding this method.

To support a MultiMemRead packet implementation, a new ReadMemoryVectorized
method is proposed, which would not query the memory cache:

virtual Expected<SmallVector<size_t>> ReadMemoryVectorized(
  ArrayRef<lldb::addr_t> vm_addrs,
  ArrayRef<size_t> read_sizes,
  MutableArrayRef<uint8_t> buf);

Where:

  • The vm_addrs and read_sizes arguments are a one-to-one map to the
    base_addr and num_bytes arguments of MultiMemRead.
  • The length of buffer buf must be at least the sum of the sizes in
    read_sizes.
  • The return value is a sequence of lengths, corresponding to how many bytes
    were read per requested address. If the overall request failed, an error is
    returned instead.

This API reflects the previous design choice of not allowing per-address
errors, instead relying on the number of bytes read to convey failures at a
per-address level. In a way, this has similar semantics to the existing API for
ReadMemory, where programmers are expected to check the Status out-param
and only then check the number of bytes read. In the new API, programmers are
required to check the error before checking the size; this is enforced by
Expected semantics.

A default implementation of ReadMemoryVectorized is provided, which
constructs individual memory read requests without using the new packet kind.
If any request fails and produces an Error object (a Status), the default
implementation can either return an Error for the Expected component of the
API – effectively failing all reads – or it can drop the individual
Error, and set the number of bytes read for the failing address to zero,
while still allowing for the other addresses to be read successfully. The
latter seems more desirable, but feedback is requested.

Concrete Process implementations should override this method if they may
support MultiMemRead. For example, a GDBRemote implementation could look like
this:

Expected<SmallVector<size_t>> ProcessGDBRemote::ReadMemoryVectorized(...) {
  if (!m_gdb_comm.GetMultiMemReadSupported())  // a new query
    return Process::ReadMemoryVectorized(...)
  // use the new packet
}

The implementation should be responsible for breaking up requests that may be
generate packets that are considered too large.

Question: a single buffer, or multiple buffers?

In the API for ReadMemoryVectorized, the proposal uses a single output buffer
for all the results: MutableArrayRef<uint8_t>. Should it instead use one
buffer per memory read, e.g., ArrayRef<MutableArrayRef<uint8_t>>?

2 Likes

Is there an advantage to having the input be two parallel arrays rather than an array of little structs with the address & requested size? I think it would be less error-prone on the client side to return the results in a vector of buffers, or even in a vector of size/buffer structs? You could even return a class that holds the return buffer internally, and hands out the elements. That way you would still have an efficient store but users wouldn’t have to hand parse it.

I don’t see the need to complicate the design by sending errors back, after all, if you really wanted to print the particular error you could just ask for that bit of memory on its own and get the error that way. Plus gdb-remote protocol errors are underspecified anyway, and aren’t that useful.

For errors, I think something other than just a 0 would be good. “0,Exx” maybe. But unlike the typical Exx, define what these mean, and let lldb spit out a message saying what happened. (which Jim talked about above, as I was writing).

The options sounds interesting. As someone who comes from an embedded/JTAG background, I was able to do memory reads with options like “virtual, physical, cached, uncached”, and other things like Harvard architecture memory spaces. These things aren’t very useful for a non-embedded user space program, but if you’re debugging a system through JTAG you might want to specify how to read/write memory. We don’t support that now, but there have been proposals over the years.

For errors, I think something other than just a 0 would be good. “0,Exx” maybe. But unlike the typical Exx, define what these mean, and let lldb spit out a message saying what happened. (which Jim talked about above, as I was writing).

Maybe I’m reading this incorrectly, but I thought Jim was suggesting the opposite of what you suggested. I.e. Jim was saying we don’t want to complicate the design by sending back individual Exx codes, instead letting clients re-request the failing through an isolated xpacket if the client is really interested in the error.

Is there an advantage to having the input be two parallel arrays rather than an array of little structs with the address & requested size?

Yes, we have many different containers for <base_addr, size> in lldb, so many to choose from. VMRange, AddressRange/AddressRangeListImpl, Range<addr_t,addr_t> (this one is particularly popular) just to name a few. I would not have two separate argument arrays in the internal, private API.

I don’t see the need to complicate the design by sending errors back, after all, if you really wanted to print the particular error you could just ask for that bit of memory on its own and get the error that way. Plus gdb-remote protocol errors are underspecified anyway, and aren’t that useful.

Agree, there is either an error response to the packet – there is no inferior process connected, the input packet was ill formatted (an odd number of arguments, for instance, in the addr+size list), or there are bytes read.

Memory read packets can always result in reading fewer bytes than requested. If you read off the end of a mapped page of memory, you may not be able to read all bytes requested. It is an accepted response that fewer bytes than requested are read, and it’s unlikely debugserver will be able to categorize the failure in more detail than that.

A few suggestions on the packet design (I think this proposal is a little extra complicated because we’re discussing two unrelated things, the internal private API and the gdb remote serial protocol packet).

I would not prefix the addresses with “0x”. I dislike the fact that gdb remote protocol has implicit base values for numbers, and the extra characters don’t really matter, but for consistency I would have it be base16 without a 0x prefix.

Question: Should the response be given as hexadecimal characters?

You mean asciihex return values here. I would not have this as an option. It was prevalent in the original gdb remote serial protocol, as seen by our friends m/M packets, but it is disfavored now that we are a bit more concerned about performance and we can assume 8-bit clean communications, e.g. see the x/X packet.

Reply syntax: <num_bytes_read1>,<bytes1>;<num_bytes2>,<bytes2>...;

I would personally list the bytes successfully read (specify the base for this number in the packet description - base16 or base10) as a list, followed by the bytes of all the read memory. Like

<num_bytes_read1>,num_bytes_read2>;<bytes1><bytes2>

primarily for ease of understanding if a human is reading the packet and wants to know which read requests got all their bytes & such? Just my opinion on this one, there’s no real difference or benefit beyond aesthetics.

Would it be helpful to support dependent reads? Meaning, the result of one memory read is the input of a subsequent memory read.

MultiMemRead:0x100200,4,\1,4,...

Stealing regex back-reference syntax, the \1 is a reference to the result of the first memory read (0x100200). I’m not suggesting this specific syntax, I’m only posing the idea of supporting dependent reads.

I thought about this, but the challenge is that doing only back references would be of limited use. For example, in the use cases that I have in mind, I would need some kind of "offset and then dereference” operation. So then we get to the question of: should we bake some expression evaluation capability into the packet? And this can get complicated fairly quickly…

I don’t think the current specification closes the door for that possibility – we could allow for non-address expressions in the future at the cost of an extra “are you able to do this” packet – but don’t think we have enough use cases to motivate that right now.

this was the inevitable next question, but but I didn’t want to get ahead of myself. It sounds like it’s somewhat desirable, but the question of complexity vs value isn’t concrete enough yet.

Are you saying it would be a new packet in the future? ie not MultiMemRead?

>Are you saying it would be a new packet in the future? ie not MultiMemRead?

No, it would be the same packet. But at the start of the session, the debugger asks the server what sorts of capabilities the server has. If we enhance MultiMemRead, we could have a query for “do you know how to do expressions in MultiMemRead”.

Thank you all for the feedback!

(I should also mention that Jason helped a lot coming up with the ideas both here and in the original proposal)

The first private API proposal was mostly trying to match existing style, but
I’m glad to hear there is support for something a bit more modern. I’ve
included an updated proposal below (the original post is intact) with these
changes:

  • Removed the 0x address prefixes.
  • Specified base-16 for all addresses and numbers of bytes.
  • Reordered reply specification so that all numbers of bytes read come first.
  • Included an “options” tag.
  • Made the private API nicer to use.
    • Decided to use a Range<addr_t, size_t>.
    • AddressRange is designed to be used with Sections, which is not
      applicable here.
    • VMRange seems to be functionally the same as a Range

Updated packet specification:

Name: MultiMemRead
Request syntax: MultiMemRead:options:<options>;ranges:<base_addr1>,<num_bytes1>;<base_addr2>,<num_bytes2>;...;
Reply syntax: <num_bytes_read1>,<num_bytes_read2>,...;<bytes1><bytes2>...;

Where:

  • <options> is an arbitrary string of options. The options:<options> field
    is optional; if present, it may have an empty <options> component.
  • <base_addr>, <num_bytes> and <num_bytes_read> are given in base-16.
  • An error code is also an acceptable reply.

Example:

send: MultiMemRead:ranges:100a00,4;200200,a0;400000,4;
receive: 4,0,2;<binary encoding of abcd1000><binary encoding of eeff>

In the example above, the first read produced abcd1000, the read of a0 bytes from address 200200 failed and the third read produced two bytes – eeff – out of the four requested.

Updated private API specification:

struct VectorizedReadResult {
  MutableArrayRef<uint8_t> getResult(unsigned index) const;
  unsigned getNumResults() const;
};

virtual Expected<VectorizedReadResult> ReadMemoryVectorized(
  ArrayRef<Range<addr_t, size_t>> vm_addrs,
  MutableArrayRef<uint8_t> buf);

After chatting with Dave, we could also make this packet more general in the following way. Instead of:

ranges:<base_addr1>,<num_bytes1>;...

We could have an optional list of offsets:

ranges:<base_addr1>,[<offset>,]*<num_bytes1>;...

Each offset would imply an extra pointer-sized memory read.

For example:

ranges:10000,2,-d,100;20000,200

Would return two values:

  1. 0x100 bytes of memory read from address -d + *(2 + *10000)
  2. 0x200 bytes of memory read from address 20000

(the * operator here means “pointer-sized dereference”)

This would allow users combine multiple memory reads into a single operation.

What do you think?

I don’t understand what any of that means. Are you trying to combine operations of (1) read a point-sized value from memory, (2) add an offset to that, (3) read n bytes from addr+offset? This seems like a very specific capability being hardcoded into a packet design for a single use, I’m open to it if we need it for a performance reason, but do we? I also am not super into having debugserver know how big a pointer is implicitly - for instance with arm64_32 it will be 32-bits even though the process is 64-bits. debugserver can tell whether the process is using the arm64_32 ABI (probably by inspecting the binary), but it’s an assumption we’ve never hardcoded into the gdb remote protocol before.

Yes, the intent was to combine multiple reads into a single packet. For example, if you know you’re going to read 10 addresses from memory, add offsets to each of them, and then use the result to read another 10 addresses (this is actually what lldb does for Swift).

That said, if debugserver doesn’t know what the pointer size is, or if it is not a common assumption we make, I’m not too attached to that generalization and would be happy to drop it. There are some other implementation details that would make such a generalization hard to adopt for the use case I had in mind.

Is this something we need to do once, or every time at private stops? The goal is to fetch everything we need for swift task tracking in a single packet at private stops. If we need to do an extra read once per thread, that’s not a lot of packets, right, it’s just going to be done once per thread. The perf problems come from things done once per thread per private stop.

I feel like we’re getting close to “send a DWARF expression down to debugserver to evaluate” territory, but limiting it to specific operations hardcoded in the packet design.

Is this something we need to do once, or every time at private stops?

Every time. The offsets idea was just a mechanism to bring the number of packets down to 1 MultiMemRead instead of 2.

I feel like we’re getting close to “send a DWARF expression down to debugserver to evaluate”

Yes, every time I had a conversation about this, it always converges to “this is the dwarf expression stack”

In order to avoid getting too sidetracked, let’s focus on this amended proposal RFC: a new vectorized memory read packet - #12 by Felipe and discard the offset idea, it would only save a packet anyway and I think there are too many downsides to it

I apologize if I was unclear. Jim and I were both talking about Exx codes not meaning anything specific in the protocol. I meant to say that we should standardize what they mean in this instance (if we include errors).