AArch64 watchpoints - reported address outside watched range; adopting MASK style watchpoints

Hi all! Coming back to this – I apologize for the delay, but I was unhappy with my proposed ideas, and all of your valuable feedback forced me to think about them more and become less happy. I’ve been rolling this around in my head for the past month and I have, I think, a much clearer idea and also some interesting asides.

My new idea is this: WatchpointLocations.

User watches a ptrsize object unaligned, that’s going to take two physical watchpoints. They’ll get one Watchpoint, and it will have two WatchpointLocations – each watching half-ptrsize. (debugserver does this internally right now, lldb is unaware of it)

User watches a 4*ptrsize object, we can have 4 ptrsize WatchpointLocations to handle this. If the user tries to set a second ptrsize watchpoint, they will get an error that no watchpoint registers are available, and can do watch list to see that they’ve used 4 watchpoints for that object. Maybe they can disable one of the 4 that they weren’t really interested in. lldb does not do this today, users have to do it as separate watchpoints manually.

User watches a 96 byte object on AArch64 and a stub that can do MASK watchpoints. That’s a single 128 byte MASK watchpoint, if the object is aligned within a single 128 byte region. read and write type watchpoints cannot tell if your actual 96 bytes were touched, but modify watchpoints (Pavel’s idea, I like it) where lldb checks if the 96 bytes were modified, could hide false positive stops (and writes that don’t actually change the values). I would make modify the default type.

I think lldb should assume that a remote stub can set some number of ptrsize watchpoints, by default. Architectures can override this if watchpoints in general have differing capabilities. (today, lldb will let you set a watchpoint 1, 2, 4, 8 bytes in size and it will send those to the remote stub and hope it is allowed) We can have some overhaul of qWatchpointSupportInfo where an AArch64 stub could tell lldb that it can do both BAS and MASK watchpoints, and some AArch64 specific Architecture thing in lldb would understand what a BAS and MASK watchpoint can do, so if the user watches 2 bytes, BAS. If the user watches 32 bytes, we can do it with one MASK watchpoint, or with 4 BAS watchpoints if that’s the only type supported.

I haven’t thought through the details of where this would be located in lldb etc but I like this approach. We could say “qWatchpointSupportInfo should have some general way of describing the capabilities of watchpoints on this target”, but I don’t know the universe of what different CPUs can do. e.g. on Intel iirc you can watch 1, 2, 4, 8 bytes in an aligned region, but you can’t do like AArch64 BAS watchpoint and watch 3 bytes. If you want to watch 3 bytes on Intel, you have to watch 4 bytes and hide accesses to byte 4 (similar to watching a 96 byte object with a 128 byte AArch64 MASK watchpoint). watch list may help the user to understand why stops are happening when their buffer wasn’t actually accessed.

In browsing lldb’s existing code, I came across an interesting MIPS change for watchpoints where the low 3 bits that caused the fault are masked. To match the fault address in the watchpoint list, Jaydeep Patil in 2015 ⚙ D11672 [MIPS] Handle false positives for MIPS hardware watchpoints added support to EmulateInstructionMIPS64 to recognize load/store instructions and calculate the memory address by decoding them. (this would be a lot more work on something like AArch64)

I also found some very interesting text in the Scalable Matrix Extension docs from ARM, Documentation – Arm Developer . In C3.2.1 “Watchpoints”, it describes several caveats to how watchpoints behave when an AArch64 processor is in Streaming SVE mode (a processor state). In this mode, if an instruction accesses a byte within a 16 byte region where a watchpoint is set, a watchpoint fault may be generated. If I watch bytes 0x100-0x1007 and the processor is in Streaming SVE mode and something writes to 0x1008, I can get a watchpoint fault and have 0x1008 (or some address within the range of the write, I think David was pointing out that it’s not guaranteed to be the start of the write).

That means when we talk about “take the FAR address and look for the nearest watchpoint”, we may get a FAR address past the watched memory region, within that 16B region. In paragraph Rcxzcy it talks about how in SVE mode or with an SME store, the FAR is within the range of the start of the write to the highest watchpointed address, so accesses which start before the watched region won’t get an address past the watched region.

This section also talks about status fields in ESR_EL1:
“Watchpoint number Valid” (when set, we will be given the watchpoint number that triggered the fault - no need to address map. When there are multiple watchpoints that were touched in the memory access, it is undefined which one will be reported, I think).
“Watchpoint might be false-positive” (the 16-byte granule issue)
“FAR not Precise”
“FAR not Valid”

“Watchpoint number Valid” (WPTV) and “FAR not Valid” (FnV) are exclusive; you either get and address that was accessed within the range of the access instruction (maybe outside the actual watched memory range), or you get the watchpoint number (in ESR_ELx.WPT).

I think this WPTV v. FnV behavior difference may be in effect on these processors when they are not in Streaming SVE mode. The 16 byte granule issue is specific to Streaming SVE mode (and we’ll have ESR_EL1 / EDHSR WPF (“watchpoint might be false-positive”) or FnP (“FAR not Precise” when the fault address is reported in FAR)).

Anyway, some more exciting examples of AArch64 watchpoint things we’ll need to handle at some point in the future, adding a bit more complexity to watchpoints. But I think it does help bolster the case for a modify style watchpoint that stops when the memory region we’re watching has actually changed value; even a simple 8-byte BAS watchpoint can get triggered by a nearby access that doesn’t touch the 8-bytes. (as opposed to our current issue where a large write starts earlier than an 8-byte watched region, but does write to it, so we get an correct watchpoint fault and the FAR address is reported at the start of the write, before our 8-byte watched region)

What do people think about the idea of WatchpointLocations? Some Architecture specific plugin ability for an Architecture to take a user requested watchpoint (“watch this 24 byte object”) and turn that into concrete watchpoints that a simple stub could implement given that CPU’s capabilities, and possibly based on what types of watchpoints the stub supports implementing.