AArch64 watchpoints - reported address outside watched range; adopting MASK style watchpoints

There’s a longstanding problem with AArch64 watchpoints (possibly on other targets too, but I see it with this target in particular where you watch 4 bytes, say, 0x100c - 0x100f, and something does a 16-byte write STP to 0x1000, the FAR register has the value 0x1000, the start of the write, and lldb doesn’t correctly associate the watchpoint hit with our watchpoint at 0x100c; it won’t disable the watchpoint, instruction step, re-enable the watchpoint and report the changed value (for a ‘write’ watchpoint). For this, I’d need to add some target specific WatchpointAddressVagueness where a watchpoint exception address within some byte range of a known watchpoint is attributed to that watchpoint. I can’t remember if you can write an entire 128 byte neon register to memory in a single instruction, but that would probably be the correct size on this target.

Second related topic, I’m switching debugserver from using Byte Address Select (BAS) watchpoints on AArch64 (which can watch any bytes within an aligned doubleword) to using MASK watchpoints - which can watch power of 2 regions of memory from 8 bytes to 2GB - and now I’ve got the problem that an exception within that power of 2 region may be touching my watched region or not. e.g. I watch 8 bytes 0x100c - 0x1013. This requires an 8 byte watchpoint at 0x1008 and an 8 byte watchpoint at 0x1010. If something writes to/accesses 0x1008, my mask watchpoint will be hit and now lldb needs to (1) understand that this is associated with this watchpoint, and (2) decide whether to notify the user about it or not.

(an aside, I have to use to watchpoints for this unaligned region, or I have to use a 16-byte watchpoint at 0x1010. You can come up with example unaligned buffers that can quickly require quite large mask watchpoints to cover with one watchpoint, e.g. 24 bytes at 0x10f0 would need a 1024 byte watchpoint at 0x1000 if I did the bits correctly just now. Or I can do it with a 16-byte watchpoint at 0x10f0 and an 8 byte watchpoint at 0x1100)

At first, I thought “well, a write watchpoint which doesn’t change the watched value is a private stop that we don’t tell the user about, right” but that is not correct. lldb shows every write to that memory even if the value in the memory is unchanged. If the behavior was “only report a public stop when the watched memory value has changed”, then I can sweep the “watchpoint hits” which are actually accesses to the region I’m mask watching, but not touching the bytes I read. Or possibly if I know the target’s maximum watchpoint vagueness value from paragraph 1, if the FAR exception address is within the actually-watched region, but further from my watched byte range than the watchpoint vagueness, I can continue silently.

I’ve got the mask watchpoints working in debugserver and am trying to figure out how to best handle these follow on issues from using this mechanism on AArch64 cleanly, and wanted to see if there are strong opinions. I think we have

  1. I want a target watchpoint exception address vagueness value to be available; the max byte range before a watched memory region that can touch my watched region, and should be associated with the watchpoint.

  2. A setting for whether a write watchpoint should silently continue if the watched memory is unmodified? I’m probably going to default this to “silently continue” on Darwin when we’re using mask watchpoints. Doesn’t help read watchpoints, but I think those are a lot less common. I’m sure there are people/use cases where people want to stop on any write, whether it wrote the same value or not (and I’d prefer preserving that behavior, but with mask watchpoints I’m struggling for any way to disambiguate between a modification of the actual memory region versus the mask address range)

  3. I wonder if debugserver shouldn’t report the start address & length of memory that is actually being watched. Z2-4 only return “OK”, “”, or “Exx” for their returns according to the gdb remote serial protocol docs; I wonder about adding something after the ‘OK’. Or possibly a feature request that could be asked at the beginning of the debug session, where lldb asks debugserver to report this in watchpoint set results. Knowing the actual memory region that had to be watched to cover the user’s request would help lldb in associating the exception address with the watchpoint.

I haven’t implemented any of the above yet, but I wanted to reach out and see if anyone has opinions or thoughts about these things. The first one – a large write can touch a watched memory region, but report an exception address before the watched region which lldb doesn’t correctly associate with the watchpoint – is a long standing issue, usually someone is watching a field of an object, they memset the object contents to zero, and the memset impl does it as a series of large writes. (e.g. discourse popped up a window for “similar issues” and it shows one of these 51927 – lldb misses AArch64 stp hardware watchpoint on certain hardware (Neoverse N1) )

Despite this post being about all the drawbacks and complications of mask watchpoints, I am very interested in taking advantage of this hardware feature to allow users to watch larger objects, I think it will improve the feature for lldb.

I think @clayborg @DavidSpickett @labath @jingham might have feedback on these issues, tagging so they see it.

Hello Jason,

unfortunately, I’m more familiar with this problem than I’d like. Several years ago, I was fixing a bug which sent the linux kernel into a tailspin if it encountered a watchpoint hit like this. The way I remember it, there is no single guaranteed value of the FAR register in this case. I believe the spec basically allows returing any address which was touched by the instruction – and in fact I have observed different behavior on different chips.

Also unfortunately, the 128-byte (I’m not sure whether you mean byte or bit) “vagueness” is not enough. An instruction like LD4 can write 256 bits (64 bytes) in a single go and DC ZVA can “write” to an entire page at once.

It’s very low-tech, but what I did in my kernel patch was to attribute the watchpoint hit to the closest watchpoint. I’m not exactly proud of that, but it seems to have worked out ok – since we know that the watchpoint was indeed hit, using the closest address was close enough.

That won’t exactly work in this case because the “logical” watched range is different than the region covered by the watchpoint., but since we’re in LLDB here, we could try to do something smarter, and disassemble the offending instruction to see the addresses that it really accesses. If you want to handle accesses through DC ZVA, then I think that might be the only way, because the 4k vagueness is just too much.

I think that a setting for watchpoint behavior makes sense. While stopping on writes can be useful, I think the most common use case is for catching the cases where the write actually changes the memory. The more fundamental part of me would want this to be a separate watchpoint type (in addition to read&write – let’s call this “modify”), so you could choose the behavior on a per-watchpoint basis, but maybe that’s a bit too much.

Regarding your third question, I would actually lean towards a different implementation. Instead of the stub trying to bin pack the the requested Z range into the hardware watchpoints, I would say that the Z packet should succeed only if the stub was able to cover the requested region exactly (and with a single watchpoint). In line with keeping the stub simple, I would leave the packing problem to the client (which can e.g. ask the user whether it wants a more precise watchpoint (and consume more hardware resources) or a less precise one (and get more false positives)). To enable the client to do that, we can extend the qWatchpointSupportInfo packet to let the stub inform it about the different kinds of watchpoints it can set. (Also, if the stub starts consuming an unpredictable number of watchpoints, then the num field of that packet becomes meaningless).

Also, I am somewhat confused. Your post seems to imply that the mask/byte watchpoint type is an all-or-nothing choice. I was under the impression that the hardware allows you to choose this on a per-watchpoint level (so you could keep the byte watchpoints for small regions, but use the mask-based ones for large regions). Am I missing something here?

regards,
Pavel

I think this is “D2.10.5 Determining the memory location that caused a Watchpoint exception” from the ARMARM, particularly “Address recorded for Watchpoint exceptions generated by other instructions”.

So yes the result can vary between hardware. QEMU vs a real Neoverse chip there was a difference. IIRC QEMU was the more “accurate” and the hardware presumably makes sacrifices for speed. Both are compliant.

Last time I tried to figure those rules out, I couldn’t convince myself it was possible to be “correct” any more than choosing the closest watchpoint would be (in fact I think I looked at the kernel for inspiration).

So for a masking watchpoint it sounds like it is needed and relatively easy to calculate.

For the rest of the watchpoints I’m not sure we can find the value. Being conservative, you could go with the " Within a naturally-aligned block of memory that is all of the following" from the ARMARM, however that also says “The size of the block is IMPLEMENTATION DEFINED. There is no architectural means of discovering the size.”.

Interesting to note that for memcopy instructions they use a block for which there is a means to read the size. We plan to support those in lldb in the future, so that’s another reason for a vagueness feature.

If you want the vagueness initially to support this masking watch where we’ve got powers of 2 to deal with, great. Wouldn’t make the rest of the watchpoints any worse.

There are 2 kinds of vagueness going on. Your power of 2 rounding and the reporting of FAR in general (even if FAR was 100% accurate you would still want to record that you had to pad the watched range to a power of 2).

If I watch some unaligned 16 byte range using this MASK watchpoint, it will have some “vagueness” due to being rounded up/aligned down. If I then store pair touching some of that vague area, does lldb also need to know that store pairs can be “vauge” too?

|------------------------|
    <   watched    >
    |--------------|
  < power of 2 range >
  |------------------|
<stp >
|----|
 ^
 |
 \- Reported watch hit outside of power of 2 range.

Choosing the closest would work here.

But then how do we know the stp touched the watched range, not just some of the power of 2 range. This is why you want the setting to continue if none of the watched range has in fact changed, correct? Seems logical to me.

This sounds a lot like memory tagging. If you look at ExpandToGranule you see a similar align down/pad up behaviour.

I hoped this would be a good example but in fact, we do this alignment in lldb and lldb-server. Mostly so that lldb can check ahead that the final range is tagged. Over in NativeProcess it’s more about allocating enough memory to store the tags we read (and after all that, ptrace also aligns it all).

The GDB protocol doesn’t require you to align the read address or the size requested.

I agree based on my memory tagging experience. There was enough nice diagnostics client side it made sense to do the alignment there (regardless of what the protocol said).

Would it be possible to do all of the hard work down in debugserver/lldb-server and not return runtime control to the debugger unless the watchpoint should be triggered? One of the main purposes of the GDB remote protocol is to isolate the debugger that uses it from having to know any specifics about a target being debugged or what the OS can or will do in response to hitting a watchpoint. I would like to just set a watchpoint and have it just work with all of the details taken care of for me if possible.

If that isn’t possible we need to add more data to the qHostInfo or qProcessInfo packets that describe any ways in which the watchpoints need assistance from the debugger if a watchpoint is triggered. But from reading this post and all the replies, there are so many various ways in which things can happen for different OSs and chips that it would make watchpoint code in LLDB quite a bit more complex and fragile as new features are added for new OSs and chips.

In scenario 1 you talk about “watchpoint exception address vagueness value to be available”. If the lldb has to track this we might stop many times with a watchpoint exception only to discover we don’t need to stop and we would need to try and continue, possibly many times, which a thread plan. If we could detect this in the GDB server binary and never bother the debugger, that would be great. When a watchpoint triggers through hardware, do we actually know which watchpoint triggered the stop or are we left to figuring that out manually by inspecting the watchpoint registers?

For scenario 2: “A setting for whether a write watchpoint should silently continue if the watched memory is unmodified?”. It depends on what is being written to and what the user wants to know. The memory could be a memory mapped register (not RAM) and writes to a specific address could trigger something to happen. The memory mapped register might also not be readable, so it might not be possible to read the memory to tell if it has changed or not. This seems like this should be an option on watchpoints like “–stop-on-change <true/false>” and we can default this to either true or false. So we can’t always rely on being able to read the memory or knowing what a memory address access or write means.

For scenario 3 it would be nice where you mention “I wonder if debugserver shouldn’t report the start address & length of memory that is actually being watched. Z2-4 only return “OK”, “”, or “Exx” for their returns according to the gdb remote serial protocol docs;”. This depends on if we can encapsulate the watchpoint functionality into the GDB server completely or not. I would vote to do this if possible and not have LLDB know anything about the watchpoint or the OS or chips specifics of how things were triggered. If the GDB stub can remember “I set this watchpoint on [0x100C-0x1010), but I had to set a hardware watchpoint using some fancy registers settings that actually is watching larger area [0x1000-0x1020), but I should auto continue if the watchpoint access didn’t cross the [0x100C-0x1010) boundary” then that is great if we can make this happen. We could also store the memory contents if in the GDB remote stub if we want to only stop when the bytes of [0x100C-0x1010) actually change. Not this might not be possible if we need to be able to disassemble code or do some other symbol lookups in order to determine if the watchpoint would trigger or not, but it sure would be nice if we could.

It seems like it will be complex to add all of the logic in LLDB for watchpoints to help cover any OS or any chip and deal with the complex ways things can happen. I believe we already have some complexity that LLDB deals with like: does a watchpoint actually execute the instruction and then stop, or does it stop before the instruction takes place. We would need to know that so we could disable the watchpoint, then single step, and then re-enable and then continue. Currently, I believe, LLDB only things a watchpoint should stop if it actually caused a read or write access to the affected address range. But this new mask watchpoints will cause us to stop incorrectly because they handle larger areas of memory. If we are watching an area on the stack we could end up stopping a ton more times, so having the GDB remote stub be able to handle this all inside of itself would be beneficial and faster.

I definitely see how implementing this in the GDB stub would be faster by eliminating context switches between LLDB and the stub, but I also think something can be said for keeping the stub simple, as Pavel and David are advocating. But the main reason it sounds like we should do this in LLDB rather than in debugserver is that this is a common problem for all the stubs. I don’t know if MASK watchpoints are supported in lldb-server, but if they are, don’t we have the exact same issue there?

Hi all! Coming back to this – I apologize for the delay, but I was unhappy with my proposed ideas, and all of your valuable feedback forced me to think about them more and become less happy. I’ve been rolling this around in my head for the past month and I have, I think, a much clearer idea and also some interesting asides.

My new idea is this: WatchpointLocations.

User watches a ptrsize object unaligned, that’s going to take two physical watchpoints. They’ll get one Watchpoint, and it will have two WatchpointLocations – each watching half-ptrsize. (debugserver does this internally right now, lldb is unaware of it)

User watches a 4*ptrsize object, we can have 4 ptrsize WatchpointLocations to handle this. If the user tries to set a second ptrsize watchpoint, they will get an error that no watchpoint registers are available, and can do watch list to see that they’ve used 4 watchpoints for that object. Maybe they can disable one of the 4 that they weren’t really interested in. lldb does not do this today, users have to do it as separate watchpoints manually.

User watches a 96 byte object on AArch64 and a stub that can do MASK watchpoints. That’s a single 128 byte MASK watchpoint, if the object is aligned within a single 128 byte region. read and write type watchpoints cannot tell if your actual 96 bytes were touched, but modify watchpoints (Pavel’s idea, I like it) where lldb checks if the 96 bytes were modified, could hide false positive stops (and writes that don’t actually change the values). I would make modify the default type.

I think lldb should assume that a remote stub can set some number of ptrsize watchpoints, by default. Architectures can override this if watchpoints in general have differing capabilities. (today, lldb will let you set a watchpoint 1, 2, 4, 8 bytes in size and it will send those to the remote stub and hope it is allowed) We can have some overhaul of qWatchpointSupportInfo where an AArch64 stub could tell lldb that it can do both BAS and MASK watchpoints, and some AArch64 specific Architecture thing in lldb would understand what a BAS and MASK watchpoint can do, so if the user watches 2 bytes, BAS. If the user watches 32 bytes, we can do it with one MASK watchpoint, or with 4 BAS watchpoints if that’s the only type supported.

I haven’t thought through the details of where this would be located in lldb etc but I like this approach. We could say “qWatchpointSupportInfo should have some general way of describing the capabilities of watchpoints on this target”, but I don’t know the universe of what different CPUs can do. e.g. on Intel iirc you can watch 1, 2, 4, 8 bytes in an aligned region, but you can’t do like AArch64 BAS watchpoint and watch 3 bytes. If you want to watch 3 bytes on Intel, you have to watch 4 bytes and hide accesses to byte 4 (similar to watching a 96 byte object with a 128 byte AArch64 MASK watchpoint). watch list may help the user to understand why stops are happening when their buffer wasn’t actually accessed.

In browsing lldb’s existing code, I came across an interesting MIPS change for watchpoints where the low 3 bits that caused the fault are masked. To match the fault address in the watchpoint list, Jaydeep Patil in 2015 ⚙ D11672 [MIPS] Handle false positives for MIPS hardware watchpoints added support to EmulateInstructionMIPS64 to recognize load/store instructions and calculate the memory address by decoding them. (this would be a lot more work on something like AArch64)

I also found some very interesting text in the Scalable Matrix Extension docs from ARM, Documentation – Arm Developer . In C3.2.1 “Watchpoints”, it describes several caveats to how watchpoints behave when an AArch64 processor is in Streaming SVE mode (a processor state). In this mode, if an instruction accesses a byte within a 16 byte region where a watchpoint is set, a watchpoint fault may be generated. If I watch bytes 0x100-0x1007 and the processor is in Streaming SVE mode and something writes to 0x1008, I can get a watchpoint fault and have 0x1008 (or some address within the range of the write, I think David was pointing out that it’s not guaranteed to be the start of the write).

That means when we talk about “take the FAR address and look for the nearest watchpoint”, we may get a FAR address past the watched memory region, within that 16B region. In paragraph Rcxzcy it talks about how in SVE mode or with an SME store, the FAR is within the range of the start of the write to the highest watchpointed address, so accesses which start before the watched region won’t get an address past the watched region.

This section also talks about status fields in ESR_EL1:
“Watchpoint number Valid” (when set, we will be given the watchpoint number that triggered the fault - no need to address map. When there are multiple watchpoints that were touched in the memory access, it is undefined which one will be reported, I think).
“Watchpoint might be false-positive” (the 16-byte granule issue)
“FAR not Precise”
“FAR not Valid”

“Watchpoint number Valid” (WPTV) and “FAR not Valid” (FnV) are exclusive; you either get and address that was accessed within the range of the access instruction (maybe outside the actual watched memory range), or you get the watchpoint number (in ESR_ELx.WPT).

I think this WPTV v. FnV behavior difference may be in effect on these processors when they are not in Streaming SVE mode. The 16 byte granule issue is specific to Streaming SVE mode (and we’ll have ESR_EL1 / EDHSR WPF (“watchpoint might be false-positive”) or FnP (“FAR not Precise” when the fault address is reported in FAR)).

Anyway, some more exciting examples of AArch64 watchpoint things we’ll need to handle at some point in the future, adding a bit more complexity to watchpoints. But I think it does help bolster the case for a modify style watchpoint that stops when the memory region we’re watching has actually changed value; even a simple 8-byte BAS watchpoint can get triggered by a nearby access that doesn’t touch the 8-bytes. (as opposed to our current issue where a large write starts earlier than an 8-byte watched region, but does write to it, so we get an correct watchpoint fault and the FAR address is reported at the start of the write, before our 8-byte watched region)

What do people think about the idea of WatchpointLocations? Some Architecture specific plugin ability for an Architecture to take a user requested watchpoint (“watch this 24 byte object”) and turn that into concrete watchpoints that a simple stub could implement given that CPU’s capabilities, and possibly based on what types of watchpoints the stub supports implementing.

I like the watchpoint location idea a lot.

Regarding the architecture-specific part, I am wondering if we can come up with some way of abstractly representing the set of possible watchpoints, such that the lldb code can be fully generic. One reason for that is that the set of supported watchpoint types is not purely a function of the CPU architecture. For example, on linux, all watchpoint setup has to go through the kernel, and the kernel enforces additional restrictions:

  • MASK watchpoints are not supported
  • BAS watchpoints are limited to consecutive bytes only (so you can watch e.g. watch 0x1003 and 0x1004 in a single go, but not 0x1003 and 0x1005)

(In addition to that, we also have a somewhat unusual internal stub, which has some other restrictions, so that the cpu+os combination would not be completely representative as a key.)

What if we used a sequence of triplets (triples?) like this:

  • the region size
  • the region alignment
  • the number of consecutive watched bytes in the region (or * if they don’t have to be consecutive, like in the BAS case)

So a regular x86_64 cpu would have a description like 8/1/8;4/1/4;2/1/2;1/1/1, an AArch64 cpu would have something like 8/8/*;16/16/16;32/32/32;..., and an AArch64 cpu on linux would have 8/8/1;8/8/2;8/8/3;8/8/4;8/8/5;8/8/6;8/8/7;8/8/8.

The trickiest part here would be coming up with a generic algorithm to determine which kind of watchpoint to use for a given combination of user requests. However, I don’t think we need to have something which would produce the optimal solution for every possible combination of inputs. I think we could just do some kind of greedy allocation by default and let people optimize it for the particular combination of watchpoint types that they care about.

I think we should keep doing that, for compatibility with stubs which don’t support our (new) qWatchpointSupportInfo packet.

Thanks for the feedback Pavel, I’m glad to hear the idea made sense to someone else too. :slight_smile: I was chatting with Jim Ingham about this a bit and he also liked the idea of the gdb remote stub having a way to describe its watchpoint capabilities in a more generic way than an aarch64 stub saying “I do BAS and MASK” and some aarch64 specific bit in lldb knowing what that means when it creates WatchpointLocations. I’m less convinced that this is a good idea, but both of you immediately went there, so it’s probably the right thing to try.

FWIW I don’t think the requirement that watched bytes be contiguous is specific to the linux kernel, the ARM ARM says “If the BAS field is programmed with more than one byte, the bytes that it is programmed with must be contiguous. For watchpoint behavior when its BAS field is programmed with non-contiguous bytes, see Other usage constraints on page D2-4739.” (and D2-4739 doesn’t say anything about this, I don’t understand why it sends us there)

It’s a lot of work to do it all but if it comes up a lot we could do specific instructions. For example store pair can have issues but I’ve not had anyone complain to me, yet, about it. If one day clang starts emitting tons of those, we can deal with it.

It’s not ideal but the architecture doesn’t provide a 100% accurate method either. So I think it’s justified.

Yes and sometimes the “block size” the report can be in is not easily readable (though you could make a reasonable guess, especially if you happen to be the cpu vendor too :slight_smile: ).

I also like this direction.

I should clarify, I’m not against the remote stub making the decisions about what to place where and when to skip or not. The speed argument is a good one.

As long as from lldb one can understand that the remote is doing that. The watchpoint locations idea sounds like exactly that, informing the user that the watchpoints in the client are essentially “virtual” and may be implemented differently by the stub.

That’s what I meant with the memory tagging example. Yes the remote is going to do all the alignment for you, but users would be very confused if we didn’t show what had been done, and why.
(though in that case, the work is very cheap to do)