[RFC] Memory annotations for llvm-exegesis

Authors: Aiden Grossman (agrossman154 at yahoo.com) , Ondrej Sykora (ondrasej at google.com).

Motivation

llvm-exegesis allows running entire snippets of code, including setting specific registers as live-in or setting specific values through the LLVM-EXEGESIS-LIVEIN and LLVM-EXEGESIS-DEFREG annotations, respectively. However, there is no similar support for benchmarking snippets that require specific regions of memory to be initialized. There is the memory scratch space, but this allocates a single block of memory at an arbitrary address and passes it through a specific register (RDI), not giving a high degree of control. Many basic blocks will require setting up memory in an arbitrary fashion so that they can run without crashing. It is often rare to have a high degree of control over the accessed memory regions when trying to benchmark an arbitrary snippet. The address could be the result of a longer computation rather than being used directly, it could be RIP-relative, or it could even be static. Having memory annotations would significantly expand the scope of the snippets that could be executed and benchmarked with llvm-exegesis.

Planned Annotations

We plan on implementing three new annotations, LLVM-EXEGESIS-MEM-DEF, LLVM-EXEGESIS-MEM-MAP, and LLVM-EXEGESIS-SNIPPET-ADDRESS.

  • LLVM-EXEGESIS-MEM-DEF will allow for specifying the definition of a page of memory, initialized with a specific value, using similar logic to LLVM-EXEGESIS-DEFREG. The specified value would be an arbitrary number of bytes with the value being repeated over the block of memory if it is less than the size of the block. The syntax would be LLVM-EXEGESIS-MEM-DEF <name of page for future reference> <size> <value>
  • LLVM-EXEGESIS-MEM-MAP <name> <address-or-register> will allow mapping a previously created page into the execution context at a specific address. The name will refer to one of the previous memory definitions. The second argument can either refer to a specific address to map the page to or a register to place an arbitrary address in pointing to that memory. This annotation can be present multiple times referencing the same memory definition, mapping a specific memory definition to multiple virtual memory addresses so that if no data cache misses are desired while benchmarking, all of the required virtual memory addresses can be mapped to a single page. This makes sure everything fits in a L1D cache easily on processors with a physically tagged data cache (e.g. recent Intel x86 microarchitectures).
  • LLVM-EXEGESIS-SNIPPET-ADDRESS will allow for setting the address of the code snippet to be executed. This can be useful for code using relative addressing with potential memory accesses happening relative to the address of the snippet.

Implementation Details

  • This patch also introduces several new failure modes (and makes some existing ones significantly more common). We plan on catching all of these errors and then propagating them to where they can be handled using LLVM’s error types. For example, when we fail to allocate memory, map memory, detect a segmentation fault in the child process, etc… we will raise an error and exit with an appropriate error code.
  • Currently, llvm-exegesis does support memory operands through the scratch space passed through a platform/architecture dependent register. We will maintain this functionality using the new memory infrastructure, having default annotations that will match the current behavior.
  • In order to be able to properly support mapping pages to specific addresses, we need to be able to ensure these addresses are available within the virtual memory space. We plan on achieving this by moving the execution of snippets to a child process. This will increase the complexity of llvm-exegesis, but will allow for this work to land and also allow for better exception handling within llvm-exegesis as a crash in the snippet will be isolated to the child process. We plan on using the ptrace system call to control the child process. This will not change the portability of llvm-exegesis. llvm-exegesis is already specific to Linux due to its reliance on the perf subsystem. We are only using a few additional Linux system calls, so this does not add any additional dependencies or any new platform restrictions.
  • We plan on building an abstraction similar to CrashRecoveryContext to work with child processes so that we can more effectively handle errors when they happen and provide better crash info. Within this abstraction, we will also be to add portability to other platforms in the future that don’t use ptrace, such as Mac and Windows.
  • All of the implementation for the memory annotations will happen within the existing BenchmarkRunner infrastructure, abstracting away the memory setup so that all BenchmarkRunners can benefit from the same setup.

Implementation Plans

We plan on performing the implementation of this in phases, separating different changes logically into individual patches.

  1. Change llvm-exegesis to execute snippets in a subprocess - We will work on this first as it provides a reasonable benefit to llvm-exegesis outside of the added memory annotations allowing for exception handling within the execution of the snippet to be handled appropriately rather than just completely crashing.
  2. Add memory annotations, starting off with the LLVM-EXEGESIS-MEM-DEF annotation and the LLVM-EXEGESIS-MEM-MAP annotation. Afterwards, we will work on adding the LLVM-EXEGESIS-SNIPPET-ADDRESS annotation.

CC: @ondrasej @legrosbuffle @gchatelet

Thanks for working on this, Aiden!

Having annotations to allocate memory will make the tool a lot more useful when working with arbitrary basic blocks, but it would also make it easier to support instructions accessing memory (e.g. instructions that access a fixed address).

So IIUC you plan on creating a child process and allocate a “sufficiently large” part of virtual memory so you can select a particular address? Won’t you run into ulimit issues?

I don’t quite understand why you need to craft the address itself. Can’t you use a semantic akin to mmap where you allocate named pages and map them to several virtual addresses. Then you assign the OS issued address to a specific register (LLVM-EXEGESIS-MEM-MAP <name> <register>).

In the same vein, how do you plan to craft the address of the snippet (i.e. LLVM-EXEGESIS-SNIPPET-ADDRESS). It seems to me that you’re always going to be limited in which address you can select so maybe it’s not the right approach.

Apart from this specific point, I acknowledge that this is a valuable extension to llvm-exegesis.

So IIUC you plan on creating a child process and allocate a “sufficiently large” part of virtual memory so you can select a particular address? Won’t you run into ulimit issues?

We’re not planning on creating a single large block of memory. We’re planning on using mmap within the child process to map in only the sections at the addresses requested by LLVM-EXEGESIS-MEM-MAP, so we shouldn’t run into any ulimit issues.

I don’t quite understand why you need to craft the address itself. Can’t you use a semantic akin to mmap where you allocate named pages and map them to several virtual addresses. Then you assign the OS issued address to a specific register (LLVM-EXEGESIS-MEM-MAP <name> <register> ).

Just passing the address of a block created by mmap doesn’t give us the flexibility that we desire. For example, there might be some snippets that use multiple indirection which means we couldn’t just pass addresses from mmap into a register for the snippet and expect it to run properly.

In the same vein, how do you plan to craft the address of the snippet (i.e. LLVM-EXEGESIS-SNIPPET-ADDRESS ). It seems to me that you’re always going to be limited in which address you can select so maybe it’s not the right approach.

To implement this, I’m planning on memcpying some setup code (and then unmapping the old code after a jump) and the test code to the correct offset. This might require an assembly setup/harness so that we have direct control over how everything is placed in memory (something I would prefer to avoid if possible, but will probably be necessary since C/C++ doesn’t have the necessary control as far as I’m aware).

We shouldn’t be limited in addresses. Within the child process, we’re planning on using munmap to clear everything other than the requested memory mappings, snippet code, and a harness. This should leave almost the entire address space open.

I still don’t understand how you can select the address. AFAIU mmap returns an address allocated by the OS but you don’t get to choose it.

Could you provide a proof of concept - maybe in the form of a github repo - so I can better understand what you have in mind?

mmap takes in an address as its first argument which is given to the OS as a hint on where to place the mapping. It ends up getting placed there if there are no allocations already there (which is why we need a subprocess to munmap the rest of the memory) along with some logic to move things onto page boundaries. Normally this argument is just set to NULL which gives the kernel full control over the placement.

I don’t have a full implementation available currently, but I should have one soon and will publish it here once I have it complete.

OSes are free to ignore the hint (for non-MAP_FIXED) even if nothing is mapped there. You should not rely on it being honoured.

Thanks for bringing this up! I believe Linux does try to honor the hint if there is no other mapping there, but you’re definitely correct in saying that we shouldn’t rely on this. We’re planning on passing the MAP_FIXED_NOREPLACE (or just MAP_FIXED depending upon what platform compatibility is needed) flag to mmap to ensure that the hint does get honored and handling all failure cases as appropriate.

@boomanaiden154-1 I should have RTFM, I completely forgot about the hint (I usually use NULL). That said, as @jrtc27 mentioned there is no guarantee that the hint will be honored. Let’s bring up a POC and check how often it fails in reality. Depending on the outcome we may need to change the design.

I envision something like requesting XX random addresses and verify that they all end up being honored.

All good. I can’t imagine that it’s common to set the addr flag when calling mmap. I wrote up a test script/data processing script that munmaps all the memory except for the program and then loops through the entire address space in 2MiB increments (I also did some testing with 4096 byte increments) and then records the output value of mmap if it isn’t negative (i.e., an error occurred). The only gaps I’m seeing in the address space are around the memory addresses where the program is loaded in (where mmap() called with MAP_FIXED_NOREPLACE will fail).

So it seems like this technique works reasonably well.

Some additional implementation details/plans now that I have the implementation fleshed out with just a lot of clean up and minor fixes to do. We’re planning on going through the following steps in the process setup:

  1. Setup auxiliary memory that can be used by the child process for keeping track of certain types of data. Currently, we’re planning on using this auxiliary memory to pass in register values to handle LLVM-EXEGESIS-DEFREG annotations as currently many registers will get clobbered at the beginning of the snippet execution during the new setup code. We’re also using it to pass file descriptors of shared memory sections that we later mmap into the child process.
  2. Create shared memory mappings based on the LLVM-EXEGESIS-MEM-DEF annotations, and fill them with the requested value.

After this setup, we call fork() to create the child process. There’s one minor detail before we before the subprocess starts executing. We create the performance counter in the parent process but tied to the child process’s PID. The file descriptor for the performance counter is then passed to the child process where it can make the appropriate ioctl calls to start/stop the counter. This allows us to avoid complexity in passing around perf counter values from the child to the parent, particularly in the LBR case.

Then the child process begins executing:

  1. We unregister the default signal handlers.
  2. Then we grab the file descriptor for the perf counter as mentioned above.
  3. mmap a region of memory for the assembled snippet/setup code.
  4. Prepare the auxiliary memory by opening all of the shared memory sections in the child process and putting the file descriptors into the auxiliary memory.
  5. Then we call into the assembled setup code/snippet. We exit from there.

Within the snippet, we do a couple of things:

  1. Move argument registers passed in from the original function call that invoked the snippet into callee-saved registers that we don’t touch until we need the values.
  2. Unmap the address space below the snippet.
  3. Unmap the address space above the snippet
  4. Map the auxiliary memory at a static address (currently one page below the ceiling of the virtual user-mode address space).
  5. Create the memory mappings specified by the annotations.
  6. Execute the snippet code.
  7. Call exit from within the snippet.

After the snippet exits, the main process checks that the process exited successfully (doing appropriate error handling if it doesn’t), and the normal flow continues.

We’ve found so far that the optimal abstraction for a lot of the platform dependent system calls is implementing a new FunctionExecutorImpl rather than the original plan of trying to implement an abstraction similar to CrashRecoveryContext.

We’ve implemented all of the target dependent code emission (for calling system calls within the generated snippet/assembly setup code) within the relevant ExegesisTarget implementations (currently only x86). I’m planning on only adding the new setup code if necessary (i.e., memory annotations are specified) so that we can gradually bring up other architectures with this new technique. On architectures not supporting memory annotations, the current scratch space technique will be used.

I’ve also posted Clarification on platform support for llvm-exegesis to figure out what platforms we need to support so we don’t unintentionally break anyone else’s workflows (mostly guiding the decision on whether or not we need to maintain an in-process FunctionExecutorImpl that can be used on other platforms.

I’m hoping to be able to start submitting patches for review soon. Probably towards the end of this week or next week.

Patches for (most of) this RFC are now up on Phabricator:

  1. ⚙ D151019 [llvm-exegesis] Refactor FunctionExecutorImpl and create factory
  2. ⚙ D151020 [llvm-exegesis] Add ability to assign perf counters to specific PID
  3. ⚙ D151021 [llvm-exegesis] Introduce Subprocess Executor Mode
  4. ⚙ D151022 [llvm-exegesis] Introduce SubprocessMemory Utility Class
  5. ⚙ D151023 [llvm-exegesis] Add Target Memory Utility Functions
  6. ⚙ D151024 [llvm-exegesis] Add memory annotation parsing
  7. ⚙ D151025 [llvm-exegesis] Add support for using memory annotations