Memory Subsystem Representation

For a while now we (Cray) have had some very primitive cache structure
information encoded into our version of LLVM. Given the more complex
memory structures introduced by Bulldozer and various accelerators, it's
time to do this Right (tm).

So I'm looking for some feedback on a proposed design.

The goal of this work is to provide Passes with useful information such
as cache sizes, resource sharing arrangements, etc. so that they may do
transformations to improve memory system performance.

Here's what I'm thinking this might look like:

- Add two new structures to the TargetMachine class: TargetMemoryInfo
  and TargetExecutionEngineInfo.

- TargetMemoryInfo will initially contain cache hierarchy information.
  It will contain a list of CacheLevelInfo objects, each of which will
  specify at least the total size of the cache at that level. It may
  also include other useful bits like associativity, inclusivity, etc.

- TargetMemoryInfo could be extended with information about various
  "special" memory regions such as local, shared, etc. memory typical on
  accelerators. This should tie into the address space mechanism
  somehow.

- TargetExecutionEngineInfo (probably need a better name) will contain a
  list of ExecutionResourceInfo objects, such as threads, cores,
  modules, sockets, etc. For example, for a Bulldozer-based system, we
  would have a set of cores contained in a module, a set of modules
  contained in a socket and so on.

- Each ExecutionResourceInfo object would contain a name to identify the
  grouping ("thread," "core," etc.) along with information about the
  number of execution resources it contains. For example, a "core"
  object might specify that it contains two "threads."

- ExecutionResourceInfo objects would also contain links to
  CacheLevelInfo objects to model how the various levels of cache are
  shared. For example, on a Bulldozer system the "core" object would
  have a link to the L1 CacheLevelInfo object, indicating that L1 is
  private to a "core." A "module" object would have a link to the L2
  CacheLevelInfo object, indicating that it is private to a "module" but
  shared by "cores" within the "module" and so on.

I don't particularly like the names TargetExecutionEngineInfo and
ExecutionResourceInfo but couldn't come up with anything better. Any
ideas?

Does this seem like a reasonable approach?

                              -Dave

The names and the exact information stored don't seem like they really
need review; it's easy to change later. Just two questions:

1. What is the expected use? Are we talking about loop optimizations here?
2. IR-level passes don't have access to a TargetMachine; is that okay?

-Eli

hi,

For a while now we (Cray) have had some very primitive cache structure
information encoded into our version of LLVM. Given the more complex
memory structures introduced by Bulldozer and various accelerators, it's
time to do this Right (tm).

So I'm looking for some feedback on a proposed design.

The goal of this work is to provide Passes with useful information such
as cache sizes, resource sharing arrangements, etc. so that they may do
transformations to improve memory system performance.

Here's what I'm thinking this might look like:

- Add two new structures to the TargetMachine class: TargetMemoryInfo
and TargetExecutionEngineInfo.

- TargetMemoryInfo will initially contain cache hierarchy information.
It will contain a list of CacheLevelInfo objects, each of which will
specify at least the total size of the cache at that level. It may
also include other useful bits like associativity, inclusivity, etc.

- TargetMemoryInfo could be extended with information about various
"special" memory regions such as local, shared, etc. memory typical on
accelerators. This should tie into the address space mechanism
somehow.

- TargetExecutionEngineInfo (probably need a better name) will contain a
list of ExecutionResourceInfo objects, such as threads, cores,
modules, sockets, etc. For example, for a Bulldozer-based system, we
would have a set of cores contained in a module, a set of modules
contained in a socket and so on.

- Each ExecutionResourceInfo object would contain a name to identify the
grouping ("thread," "core," etc.) along with information about the
number of execution resources it contains. For example, a "core"
object might specify that it contains two "threads."

- ExecutionResourceInfo objects would also contain links to
CacheLevelInfo objects to model how the various levels of cache are
shared. For example, on a Bulldozer system the "core" object would
have a link to the L1 CacheLevelInfo object, indicating that L1 is
private to a "core." A "module" object would have a link to the L2
CacheLevelInfo object, indicating that it is private to a "module" but
shared by "cores" within the "module" and so on.

I don't particularly like the names TargetExecutionEngineInfo and
ExecutionResourceInfo but couldn't come up with anything better. Any
ideas?

Does this seem like a reasonable approach?

The names and the exact information stored don't seem like they really
need review; it's easy to change later. Just two questions:

1. What is the expected use? Are we talking about loop optimizations here?
2. IR-level passes don't have access to a TargetMachine; is that okay?

I think they can implement as immutable pass, just like TargetData.

best regards
ether

Eli Friedman <eli.friedman@gmail.com> writes:

1. What is the expected use? Are we talking about loop optimizations here?

Initially, anything that is interested in cache configuration would find
this useful. This might include:

- cache blocking
- prefetching
- reuse analysis

I think mostly it would be loop-level stuff (that's where the time is
spent, after all) but I also know of various papers that do IPO
cache-related analysis and transformation.

2. IR-level passes don't have access to a TargetMachine; is that okay?

I thought about that too. I don't know any better place to put it
because this is very (sub)target-specific stuff. I think in the future
we may want to consider a generic interface for IR-level passes to query
some target-specific parameters that are generally useful. Cache
structure would be one.

Our (Cray) current uses are all in Machine-level passes but that's
because most of our analysis and transformation is done outside LLVM.

Mostly I'm concerned about getting the abstraction right. Or at least
reasonable. :slight_smile:

                              -Dave

Hi Dave,

Can you describe which passes may benefit from this information ? My intuition is that until there are a number of passes which require this information, there are other ways to provide this information. One way would be to use Metadata.

Having said that, I do share the feeling that IR-level optimization often need more target-specific information. For example, vectorizing compilers need to know which instructions set the target has, etc. To this end, we have implemented a new 'instcombine-like' pass which has optimizations which should have gone into 'instcombine' had we had more information about the target.

Nadav

"Rotem, Nadav" <nadav.rotem@intel.com> writes:

Can you describe which passes may benefit from this information ? My
intuition is that until there are a number of passes which require
this information, there are other ways to provide this
information. One way would be to use Metadata.

We have Cray-specific passes that use this information. Some of the
stuff Polly is doing almost certainly would benefit.

Metadata seems a very clunky way to do this. It is so target-specific
that it would render IR files completely target-dependent. These are
rather complex structures we're talking about. Encoding it in metadata
would be inconvenient.

Having said that, I do share the feeling that IR-level optimization
often need more target-specific information. For example, vectorizing
compilers need to know which instructions set the target has, etc.

Yep, absolutely.

To this end, we have implemented a new 'instcombine-like' pass which
has optimizations which should have gone into 'instcombine' had we had
more information about the target.

Right. Exposing some target attributes via generic queries (e.g. what's
the max vector length for this scalar type?) has been on my wishlist for
a while now.

                            -Dave

"Rotem, Nadav" <nadav.rotem@intel.com> writes:

Can you describe which passes may benefit from this information ? My
intuition is that until there are a number of passes which require
this information, there are other ways to provide this
information. One way would be to use Metadata.

We have Cray-specific passes that use this information. Some of the
stuff Polly is doing almost certainly would benefit.

This sounds like an interesting addition, but we don't just speculatively add analysis passes to LLVM. Doing so leads to overdesign and lack of purpose. I'd suggest designing and implementing a specific loop optimization pass, hard coding it to a specific target, then generalize it by adding a way to get target parameters. This is how LSR was built, for example, which drove the address mode selection stuff.

Metadata seems a very clunky way to do this. It is so target-specific
that it would render IR files completely target-dependent. These are
rather complex structures we're talking about. Encoding it in metadata
would be inconvenient.

I agree that metadata isn't the right way to go, but this argument doesn't fly at all with me. The whole point of these passes are to do target-specific optimizations.

The right way to expose this sort of thing is with the new TargetRegistry interfaces. That said, speculatively adding target hooks isn't the right way to go, the client should come first.

-Chris

Chris Lattner <clattner@apple.com> writes:

"Rotem, Nadav" <nadav.rotem@intel.com> writes:

Can you describe which passes may benefit from this information ? My
intuition is that until there are a number of passes which require
this information, there are other ways to provide this
information. One way would be to use Metadata.

We have Cray-specific passes that use this information. Some of the
stuff Polly is doing almost certainly would benefit.

This sounds like an interesting addition, but we don't just
speculatively add analysis passes to LLVM.

Well, we do have such an analysis pass. I simply can't make it public.
My plan was to provide the information we use today and leave
enhancements to others as they are needed.

Doing so leads to overdesign and lack of purpose. I'd suggest
designing and implementing a specific loop optimization pass, hard
coding it to a specific target, then generalize it by adding a way to
get target parameters. This is how LSR was built, for example, which
drove the address mode selection stuff.

Unfortunately, I don't have the bandwidth to design a whole new loop
pass. If someone has the time and interest I would be very happy to
work with those people.

Metadata seems a very clunky way to do this. It is so target-specific
that it would render IR files completely target-dependent. These are
rather complex structures we're talking about. Encoding it in metadata
would be inconvenient.

I agree that metadata isn't the right way to go, but this argument
doesn't fly at all with me. The whole point of these passes are to do
target-specific optimizations.

Yes, but one may want to take a single IR file and target multiple
machines from it.

The right way to expose this sort of thing is with the new
TargetRegistry interfaces. That said, speculatively adding target
hooks isn't the right way to go, the client should come first.

Ok, I will look at TargetRegistry. And again, it isn't speculative,
it's just not public.

                              -Dave

The point of having a client is so that it will drive the design of this pass. Coupled with peer review on llvm-commits, this ensures that we build things at the right level of abstraction etc. Skipping parts of the process (peer review, driving implementation based on need, etc) doesn't make sense.

-Chris

Chris Lattner <clattner@apple.com> writes: