Adjusting Load Latencies

Hello,

I am interested in writing an analysis pass that looks at the stride
used for loads in a loop and passes that information down so that it
can be used by the instruction scheduler. The reason is that if the
load stride is greater than the cache line size, then I would expect
the load to always miss the cache, and, as a result, the scheduler
should use a much larger effective latency when scheduling the load and
its dependencies. Cache-miss metadata might also be a good supplemental
option. I can add methods to TLI that can convert the access stride
information into effective latency information, but what is the best
way to annotate the loads so that the information will be available to
the SDNodes?

Has anyone tried something like this before?

A related issue is automatically adding prefetching to loops. The
trick here is to accurately estimate the number of cycles the loop
body will take the execute (so that you prefetch the correct amount
ahead). This information is not really available until instruction
scheduling, and so prefetch adding cannot really complete until just
before MC generation (the prefetch instructions can be scheduled, but
their constant offset needs to be held free for a while). In addition,
estimating the number of cycles also requires relatively accurate
load/store latiencies, and this, in turn, requires cache-miss latencies
to be accounted for (which must then account for the prefetches).

If anyone has thoughts on these ideas, I would like to hear them.

Thanks again,
Hal

If you annotate loads with their expected latency, the upcoming MachineScheduler will be able to use the information. In the short term (next couple months), you're free to hack the SDScheduler as well.

Although the scheduler can use the information, I don't think it can do much good with it scheduling for mainstream targets. It would be more interesting scheduling for an in-order machine without a hardware prefetch unit.

An acyclic instruction scheduler can schedule for L1 and L2 latency at most. But the out-of-order engine should be able to compensate for these latencies.

L2 misses within a high trip count loop will benefit greatly from stride prefetching. But regular strides should already be handled in hardware.

So my suggestions are:

1) If you have an in-order machine, your workload actually fits in L2, and you care deeply about every stall cycle, it may be useful for the scheduler distinguish between expected L1 vs L2 latency. Try to issue multiple L1 misses in parallel or cover their latency. You can consider offsets from aligned objects in addition to induction variable strides.

2) If you have a machine without hardware prefetching, you really need to insert prefetches. This is a much bigger bang for the buck than scheduling for L1/L2 latency. To cover the latency of an L2 miss, which is what really matters, you need to prefetch many iterations ahead. Rather that trying to predict the number of cycles each iteration takes, you're better off prefetching as many iterations ahead as possible up to the hardware's limit on outstanding loads. If the loop has a constant trip count, you can probably do something clever. Otherwise I think branch profiling could help by telling you which loops have a very high trip count. I think prefetch insertion is more closely tied to loop unrolling than instruction scheduling.

-Andy

> Hello,
>
> I am interested in writing an analysis pass that looks at the stride
> used for loads in a loop and passes that information down so that it
> can be used by the instruction scheduler. The reason is that if the
> load stride is greater than the cache line size, then I would expect
> the load to always miss the cache, and, as a result, the scheduler
> should use a much larger effective latency when scheduling the load
> and its dependencies. Cache-miss metadata might also be a good
> supplemental option. I can add methods to TLI that can convert the
> access stride information into effective latency information, but
> what is the best way to annotate the loads so that the information
> will be available to the SDNodes?
>
> Has anyone tried something like this before?
>
> A related issue is automatically adding prefetching to loops. The
> trick here is to accurately estimate the number of cycles the loop
> body will take the execute (so that you prefetch the correct amount
> ahead). This information is not really available until instruction
> scheduling, and so prefetch adding cannot really complete until just
> before MC generation (the prefetch instructions can be scheduled,
> but their constant offset needs to be held free for a while). In
> addition, estimating the number of cycles also requires relatively
> accurate load/store latiencies, and this, in turn, requires
> cache-miss latencies to be accounted for (which must then account
> for the prefetches).
>
> If anyone has thoughts on these ideas, I would like to hear them.

Andy,

Thank you for writing such a detailed response.

If you annotate loads with their expected latency, the upcoming
MachineScheduler will be able to use the information. In the short
term (next couple months), you're free to hack the SDScheduler as
well.

Alright, sounds good. If I add metadata to the load, can I get to it
thought the Value * in the associated MachineMemOperand object?

Although the scheduler can use the information, I don't think it can
do much good with it scheduling for mainstream targets. It would be
more interesting scheduling for an in-order machine without a
hardware prefetch unit.

An acyclic instruction scheduler can schedule for L1 and L2 latency
at most. But the out-of-order engine should be able to compensate for
these latencies.

L2 misses within a high trip count loop will benefit greatly from
stride prefetching. But regular strides should already be handled in
hardware.

I agree. For the machine with which I'm working, the hardware prefetch
unit only works if you access N consecutive cache lines. Any pattern
that does not do that will need explicit prefetch instructions. Also, I
need to be careful not to prefetch too much because the request buffer
is fairly small (it handles < 10 outstanding requests).

So my suggestions are:

1) If you have an in-order machine, your workload actually fits in
L2, and you care deeply about every stall cycle, it may be useful for
the scheduler distinguish between expected L1 vs L2 latency. Try to
issue multiple L1 misses in parallel or cover their latency. You can
consider offsets from aligned objects in addition to induction
variable strides.

Issuing multiple L1 misses in parallel is exactly what I would like to
do. Offsets from aligned objects is also a good idea.

2) If you have a machine without hardware prefetching, you really
need to insert prefetches. This is a much bigger bang for the buck
than scheduling for L1/L2 latency. To cover the latency of an L2
miss, which is what really matters, you need to prefetch many
iterations ahead. Rather that trying to predict the number of cycles
each iteration takes, you're better off prefetching as many
iterations ahead as possible up to the hardware's limit on
outstanding loads. If the loop has a constant trip count, you can
probably do something clever. Otherwise I think branch profiling
could help by telling you which loops have a very high trip count. I
think prefetch insertion is more closely tied to loop unrolling than
instruction scheduling.

I'm afraid that "as many iterations ahead as possible" may turn out to
be too few if I start at the very next iteration because the request
buffer is small. Nevertheless, it is certainly worth a try.

Thanks again,
Hal

If you annotate loads with their expected latency, the upcoming
MachineScheduler will be able to use the information. In the short
term (next couple months), you’re free to hack the SDScheduler as
well.

Alright, sounds good. If I add metadata to the load, can I get to it
thought the Value * in the associated MachineMemOperand object?

AFAIK. I certainly don’t have a problem with that approach. I’ve heard there’s a preference for lowering information into self-contained machine code. But referring back to IR makes a lot of sense to me personally. If we ever want to serialize MIs I think we should serialize the IR with it.

I’m afraid that “as many iterations ahead as possible” may turn out to
be too few if I start at the very next iteration because the request
buffer is small. Nevertheless, it is certainly worth a try.

It sounds like it’s really important for you to avoid useless prefetches. This can be tricky. Other than that I don’t see a way around your problem. Does it help to give your prefetches a head start? If your loop eats cache lines faster than you can feed it, eventually it will catch up.

-Andy