AArch64 watchpoints - reported address outside watched range; adopting MASK style watchpoints

There’s a longstanding problem with AArch64 watchpoints (possibly on other targets too, but I see it with this target in particular where you watch 4 bytes, say, 0x100c - 0x100f, and something does a 16-byte write STP to 0x1000, the FAR register has the value 0x1000, the start of the write, and lldb doesn’t correctly associate the watchpoint hit with our watchpoint at 0x100c; it won’t disable the watchpoint, instruction step, re-enable the watchpoint and report the changed value (for a ‘write’ watchpoint). For this, I’d need to add some target specific WatchpointAddressVagueness where a watchpoint exception address within some byte range of a known watchpoint is attributed to that watchpoint. I can’t remember if you can write an entire 128 byte neon register to memory in a single instruction, but that would probably be the correct size on this target.

Second related topic, I’m switching debugserver from using Byte Address Select (BAS) watchpoints on AArch64 (which can watch any bytes within an aligned doubleword) to using MASK watchpoints - which can watch power of 2 regions of memory from 8 bytes to 2GB - and now I’ve got the problem that an exception within that power of 2 region may be touching my watched region or not. e.g. I watch 8 bytes 0x100c - 0x1013. This requires an 8 byte watchpoint at 0x1008 and an 8 byte watchpoint at 0x1010. If something writes to/accesses 0x1008, my mask watchpoint will be hit and now lldb needs to (1) understand that this is associated with this watchpoint, and (2) decide whether to notify the user about it or not.

(an aside, I have to use to watchpoints for this unaligned region, or I have to use a 16-byte watchpoint at 0x1010. You can come up with example unaligned buffers that can quickly require quite large mask watchpoints to cover with one watchpoint, e.g. 24 bytes at 0x10f0 would need a 1024 byte watchpoint at 0x1000 if I did the bits correctly just now. Or I can do it with a 16-byte watchpoint at 0x10f0 and an 8 byte watchpoint at 0x1100)

At first, I thought “well, a write watchpoint which doesn’t change the watched value is a private stop that we don’t tell the user about, right” but that is not correct. lldb shows every write to that memory even if the value in the memory is unchanged. If the behavior was “only report a public stop when the watched memory value has changed”, then I can sweep the “watchpoint hits” which are actually accesses to the region I’m mask watching, but not touching the bytes I read. Or possibly if I know the target’s maximum watchpoint vagueness value from paragraph 1, if the FAR exception address is within the actually-watched region, but further from my watched byte range than the watchpoint vagueness, I can continue silently.

I’ve got the mask watchpoints working in debugserver and am trying to figure out how to best handle these follow on issues from using this mechanism on AArch64 cleanly, and wanted to see if there are strong opinions. I think we have

  1. I want a target watchpoint exception address vagueness value to be available; the max byte range before a watched memory region that can touch my watched region, and should be associated with the watchpoint.

  2. A setting for whether a write watchpoint should silently continue if the watched memory is unmodified? I’m probably going to default this to “silently continue” on Darwin when we’re using mask watchpoints. Doesn’t help read watchpoints, but I think those are a lot less common. I’m sure there are people/use cases where people want to stop on any write, whether it wrote the same value or not (and I’d prefer preserving that behavior, but with mask watchpoints I’m struggling for any way to disambiguate between a modification of the actual memory region versus the mask address range)

  3. I wonder if debugserver shouldn’t report the start address & length of memory that is actually being watched. Z2-4 only return “OK”, “”, or “Exx” for their returns according to the gdb remote serial protocol docs; I wonder about adding something after the ‘OK’. Or possibly a feature request that could be asked at the beginning of the debug session, where lldb asks debugserver to report this in watchpoint set results. Knowing the actual memory region that had to be watched to cover the user’s request would help lldb in associating the exception address with the watchpoint.

I haven’t implemented any of the above yet, but I wanted to reach out and see if anyone has opinions or thoughts about these things. The first one – a large write can touch a watched memory region, but report an exception address before the watched region which lldb doesn’t correctly associate with the watchpoint – is a long standing issue, usually someone is watching a field of an object, they memset the object contents to zero, and the memset impl does it as a series of large writes. (e.g. discourse popped up a window for “similar issues” and it shows one of these 51927 – lldb misses AArch64 stp hardware watchpoint on certain hardware (Neoverse N1) )

Despite this post being about all the drawbacks and complications of mask watchpoints, I am very interested in taking advantage of this hardware feature to allow users to watch larger objects, I think it will improve the feature for lldb.

I think @clayborg @DavidSpickett @labath @jingham might have feedback on these issues, tagging so they see it.

Hello Jason,

unfortunately, I’m more familiar with this problem than I’d like. Several years ago, I was fixing a bug which sent the linux kernel into a tailspin if it encountered a watchpoint hit like this. The way I remember it, there is no single guaranteed value of the FAR register in this case. I believe the spec basically allows returing any address which was touched by the instruction – and in fact I have observed different behavior on different chips.

Also unfortunately, the 128-byte (I’m not sure whether you mean byte or bit) “vagueness” is not enough. An instruction like LD4 can write 256 bits (64 bytes) in a single go and DC ZVA can “write” to an entire page at once.

It’s very low-tech, but what I did in my kernel patch was to attribute the watchpoint hit to the closest watchpoint. I’m not exactly proud of that, but it seems to have worked out ok – since we know that the watchpoint was indeed hit, using the closest address was close enough.

That won’t exactly work in this case because the “logical” watched range is different than the region covered by the watchpoint., but since we’re in LLDB here, we could try to do something smarter, and disassemble the offending instruction to see the addresses that it really accesses. If you want to handle accesses through DC ZVA, then I think that might be the only way, because the 4k vagueness is just too much.

I think that a setting for watchpoint behavior makes sense. While stopping on writes can be useful, I think the most common use case is for catching the cases where the write actually changes the memory. The more fundamental part of me would want this to be a separate watchpoint type (in addition to read&write – let’s call this “modify”), so you could choose the behavior on a per-watchpoint basis, but maybe that’s a bit too much.

Regarding your third question, I would actually lean towards a different implementation. Instead of the stub trying to bin pack the the requested Z range into the hardware watchpoints, I would say that the Z packet should succeed only if the stub was able to cover the requested region exactly (and with a single watchpoint). In line with keeping the stub simple, I would leave the packing problem to the client (which can e.g. ask the user whether it wants a more precise watchpoint (and consume more hardware resources) or a less precise one (and get more false positives)). To enable the client to do that, we can extend the qWatchpointSupportInfo packet to let the stub inform it about the different kinds of watchpoints it can set. (Also, if the stub starts consuming an unpredictable number of watchpoints, then the num field of that packet becomes meaningless).

Also, I am somewhat confused. Your post seems to imply that the mask/byte watchpoint type is an all-or-nothing choice. I was under the impression that the hardware allows you to choose this on a per-watchpoint level (so you could keep the byte watchpoints for small regions, but use the mask-based ones for large regions). Am I missing something here?

regards,
Pavel

I think this is “D2.10.5 Determining the memory location that caused a Watchpoint exception” from the ARMARM, particularly “Address recorded for Watchpoint exceptions generated by other instructions”.

So yes the result can vary between hardware. QEMU vs a real Neoverse chip there was a difference. IIRC QEMU was the more “accurate” and the hardware presumably makes sacrifices for speed. Both are compliant.

Last time I tried to figure those rules out, I couldn’t convince myself it was possible to be “correct” any more than choosing the closest watchpoint would be (in fact I think I looked at the kernel for inspiration).

So for a masking watchpoint it sounds like it is needed and relatively easy to calculate.

For the rest of the watchpoints I’m not sure we can find the value. Being conservative, you could go with the " Within a naturally-aligned block of memory that is all of the following" from the ARMARM, however that also says “The size of the block is IMPLEMENTATION DEFINED. There is no architectural means of discovering the size.”.

Interesting to note that for memcopy instructions they use a block for which there is a means to read the size. We plan to support those in lldb in the future, so that’s another reason for a vagueness feature.

If you want the vagueness initially to support this masking watch where we’ve got powers of 2 to deal with, great. Wouldn’t make the rest of the watchpoints any worse.

There are 2 kinds of vagueness going on. Your power of 2 rounding and the reporting of FAR in general (even if FAR was 100% accurate you would still want to record that you had to pad the watched range to a power of 2).

If I watch some unaligned 16 byte range using this MASK watchpoint, it will have some “vagueness” due to being rounded up/aligned down. If I then store pair touching some of that vague area, does lldb also need to know that store pairs can be “vauge” too?

|------------------------|
    <   watched    >
    |--------------|
  < power of 2 range >
  |------------------|
<stp >
|----|
 ^
 |
 \- Reported watch hit outside of power of 2 range.

Choosing the closest would work here.

But then how do we know the stp touched the watched range, not just some of the power of 2 range. This is why you want the setting to continue if none of the watched range has in fact changed, correct? Seems logical to me.

This sounds a lot like memory tagging. If you look at ExpandToGranule you see a similar align down/pad up behaviour.

I hoped this would be a good example but in fact, we do this alignment in lldb and lldb-server. Mostly so that lldb can check ahead that the final range is tagged. Over in NativeProcess it’s more about allocating enough memory to store the tags we read (and after all that, ptrace also aligns it all).

The GDB protocol doesn’t require you to align the read address or the size requested.

I agree based on my memory tagging experience. There was enough nice diagnostics client side it made sense to do the alignment there (regardless of what the protocol said).

Would it be possible to do all of the hard work down in debugserver/lldb-server and not return runtime control to the debugger unless the watchpoint should be triggered? One of the main purposes of the GDB remote protocol is to isolate the debugger that uses it from having to know any specifics about a target being debugged or what the OS can or will do in response to hitting a watchpoint. I would like to just set a watchpoint and have it just work with all of the details taken care of for me if possible.

If that isn’t possible we need to add more data to the qHostInfo or qProcessInfo packets that describe any ways in which the watchpoints need assistance from the debugger if a watchpoint is triggered. But from reading this post and all the replies, there are so many various ways in which things can happen for different OSs and chips that it would make watchpoint code in LLDB quite a bit more complex and fragile as new features are added for new OSs and chips.

In scenario 1 you talk about “watchpoint exception address vagueness value to be available”. If the lldb has to track this we might stop many times with a watchpoint exception only to discover we don’t need to stop and we would need to try and continue, possibly many times, which a thread plan. If we could detect this in the GDB server binary and never bother the debugger, that would be great. When a watchpoint triggers through hardware, do we actually know which watchpoint triggered the stop or are we left to figuring that out manually by inspecting the watchpoint registers?

For scenario 2: “A setting for whether a write watchpoint should silently continue if the watched memory is unmodified?”. It depends on what is being written to and what the user wants to know. The memory could be a memory mapped register (not RAM) and writes to a specific address could trigger something to happen. The memory mapped register might also not be readable, so it might not be possible to read the memory to tell if it has changed or not. This seems like this should be an option on watchpoints like “–stop-on-change <true/false>” and we can default this to either true or false. So we can’t always rely on being able to read the memory or knowing what a memory address access or write means.

For scenario 3 it would be nice where you mention “I wonder if debugserver shouldn’t report the start address & length of memory that is actually being watched. Z2-4 only return “OK”, “”, or “Exx” for their returns according to the gdb remote serial protocol docs;”. This depends on if we can encapsulate the watchpoint functionality into the GDB server completely or not. I would vote to do this if possible and not have LLDB know anything about the watchpoint or the OS or chips specifics of how things were triggered. If the GDB stub can remember “I set this watchpoint on [0x100C-0x1010), but I had to set a hardware watchpoint using some fancy registers settings that actually is watching larger area [0x1000-0x1020), but I should auto continue if the watchpoint access didn’t cross the [0x100C-0x1010) boundary” then that is great if we can make this happen. We could also store the memory contents if in the GDB remote stub if we want to only stop when the bytes of [0x100C-0x1010) actually change. Not this might not be possible if we need to be able to disassemble code or do some other symbol lookups in order to determine if the watchpoint would trigger or not, but it sure would be nice if we could.

It seems like it will be complex to add all of the logic in LLDB for watchpoints to help cover any OS or any chip and deal with the complex ways things can happen. I believe we already have some complexity that LLDB deals with like: does a watchpoint actually execute the instruction and then stop, or does it stop before the instruction takes place. We would need to know that so we could disable the watchpoint, then single step, and then re-enable and then continue. Currently, I believe, LLDB only things a watchpoint should stop if it actually caused a read or write access to the affected address range. But this new mask watchpoints will cause us to stop incorrectly because they handle larger areas of memory. If we are watching an area on the stack we could end up stopping a ton more times, so having the GDB remote stub be able to handle this all inside of itself would be beneficial and faster.

I definitely see how implementing this in the GDB stub would be faster by eliminating context switches between LLDB and the stub, but I also think something can be said for keeping the stub simple, as Pavel and David are advocating. But the main reason it sounds like we should do this in LLDB rather than in debugserver is that this is a common problem for all the stubs. I don’t know if MASK watchpoints are supported in lldb-server, but if they are, don’t we have the exact same issue there?