[ESI (ostensibly)] On memory services/streams

Memory accesses – being a very common and generally poorly optimized (IMO) construct – bear special consideration in ESI (I think) and more generally. I’m writing down my thoughts on this to spur discussion about this topic and re-frame the problem. Also, I promised @mortbopet a while back I’d write my thoughts on the topic. The recent post about a stream dialect prompted me to write this up. (And I’m procrastinating on some paper reviews.)

These are just initial thoughts. They refer to things in ESI which are not implemented or even fully developed ideas.

General/random accesses

If a kernel needs actually needs it, one can imagine providing an ESI Service (#2811) to support arbitrary memory request-based accesses. I won’t spell this out, but it would involve read and write ports which accept arbitrary addresses or perhaps a restricted subset. It could be backed by a cache which could – in turn – be customized to the application.

Alternatively, an Any typed ESI MMIO region could be used. MMIO regions will probably lower to ESI services once we start working on them, so same difference.

This is the standard way of doing things on a SoC.

Structured memory streams

The usual request/response mechanism adds latency and wastes resources in some (most, I suspect) cases. The full generality of memory isn’t necessary for some (most, again I suspect) accelerators! If one can figure out a memory access pattern for an accelerator, one can model that instead and perhaps synthesize hardware with an implicit request (address) structure. The accelerator – in the ideal case – wouldn’t need to even have knowledge of addresses!

Take the simple case of a data stream. The backing memory must be loaded with the data stream, so we need a write stream port. It would probably have the ESI list data type (not supported yet) to support arbitrary-length data as a single message or an array data type (with data windows). Since we also need to read it, it’d need at least one read port, which would also be the list/array type.

If we only need to write-once and read-once, this could be modeled as a simple ESI connection with a memory-backed buffer – the memory is implicit, so there’s no need to use an ESI memory stream.

Since we want to use the data stored in the memory multiple times, we would also need a “start stream” request port. Importantly, this port need not be driven by the stream reader. Another ‘controller’ module could be orchestrating multiple memories, ‘pushing’ data to the functional units, so all the data streams are ready when the functional units need the data (rather than sending a request and waiting for the response).

Taking that idea one step further, if a consumer knows that it consumes data at a particular stride (within a message), that could be specified when the read port is created. (Allowing the potential opportunity to discard some data on write if it’s never going to be read.) This ‘stride’ could be modeled as only a certain field in a struct assuming the message is a list/array of structs. (This could be modeled as an ESI ‘blind’ which could be applied to any ESI channel.)

One could expand this “start stream” request to include a “slot” tag to fetch a particular ESI message which was stored in that memory. Having memories structured as “message buffers” has the additional benefit of allowing memory layout abstraction – the implementation gets to choose (though it would probably have to “advertise” its capabilities). If an implementation wants to dynamically manage memory in on-chip caches to off-chip memory, it can do that. If a memory implementation wants to statically partition on-chip memory and only support a certain number of messages of certain max sizes, it is free to do so. If it wants to stripe the message by struct field to optimize physical placement proximity to the consuming units which only need said “strides”, it could do that.

Data dependent loads

Some algorithms fundamentally need the ability to fetch a piece of data based on some other in-memory data. In other words, it needs an address generator with access to memory. These “pointer-chasing” workloads (e.g., sparse graph processing) could still benefit (I’d argue without much evidence) from computing the address separately from the data processing. If we model the “address generator” separately, the memory implementation could choose where it should be placed.

For instance, usually these address generators are latency sensitive so it might make sense to put them in parallel to an on-chip cache (assuming off-chip memory) so as to avoid additional cache latency. Or have a special low-latency addresses-only cache. Or instantiate a bunch in parallel for workloads with little temporal locality but massive parallelism.

Either way, by modeling the address generation separately from the data processing logic (and creating a data stream into the data processing logic), we give the compiler more flexibility to optimize the design.

This is obvious in retrospect, but the talk on Amythyst in HotChips today had affine memory blocks. The affine dialect is an obvious model here: have a memory service which operates on a set of affine loads and stores. Just send a message to it with some base pointers, some bounds and it starts feeding all the loads and servicing all the writes! How to model RAW deps is the question…

Just wanted to get this down while I was thinking about it.