Understanding the impact of instruction scheduling for modern desctop CPUs

Hi everyone.

I’m trying to understand the impact of the scheduling models in LLVM on the performance of the generated code. I chose zstd for the benchmark. I can clearly see a 1.7x speedup, when adding -O3 -march=znver1. However, I can see no significant impact on the runtime, when butchering the scheduling algorithms. Here’s what I’ve tried so far:

  1. Disabling MachineSchedulerPass and pre sched pass in CodeGenPassBuilder.h
  2. Butchering the X86ScheduleZnver1.td
  3. Passing -mllvm -fast-isel to clang.
    The last option didn’t event compile, while the first two did not produce any meaningful difference for the benchmark.

I have two theories on why that failed:

  1. My benchmark is not a good fit for this kind of experiment. If that is the case, does anyone know workloads, where scheduling algorithm misbehavior leads to noticeable performance impact?
  2. I did a bad job screwing up the scheduler. If that is the case, can anyone point me in the right direction?

You failed to mention a target architecture/implementation::

However;

A lot of work in the last 30 years has been to make the HW pretty much put up with code
scheduled in any reasonable way. {Reservation Stations, Dispatch Stack, Scoreboards, and
more} all try to let the HW perform its own scheduling based on actual dependencies not
just real-or-imagined ones. HW is good at scheduling based on register and memory
dependencies, even better based on flow-dependencies.

You also failed to mention the kids of data being processed::

General purpose instructions are a lot easier to schedule in HW than SIMD or Vector instructions.

I need to update my previous post::

Memory reference instructions may want to be scheduled along with instructions which
have significant latency in calculation (FDIV) and/or are not fully pipelined.
With delay slots falling out of favor, scheduling into the delay slot has fallen similarly.

Memory reference instruction have {2.3.4,5} cycles of latency and are fully pipelined.
Integer multiply may have a handful of cycles of latency, often fully-pipelined, often not.
Almost all FP calculations have significant latency {3.4.5} cycles, but these are typically fully pipelined.

This leaves us with:
IDIV, FDIV, FSQRT as the not-fully-pipelined high latency instructions.
IMUL, FPnormal, LD as the fully pipelined moderate latency instructions.
And basically everything else in RISC-land has latency = 1 or is not used often enough to cause worry.

Narrow issue machines are more sensitive to instruction scheduling than wide issue machines.

Wide issue machines have the resources needed to preform scheduling for themselves.

@Mitch_Alsup thank you for a detailed reply. I’m running my tests on a 2nd gen amd threadripper cpu, and targeting -march=znver1 since I don’t see a separate target for Zen+.
As for the kinds of data being processed, I’ve been running a random big file compression with zstd. Not the best test for a compression algorithm, but I presume this does not matter when benchmarking CPU scheduling. Zstd seems to not use a lot of vector instructions on its own, but it does seem to benefit from autovectorization.
So, do I get it right, that for a general-purpose computation it doesn’t really make sense to bother with compiler backend scheduling? That statement probably requires a definition of a “general purpose”, but it seems like file compression, code generation, web browsing and similar tasks don’t use a lot of FP computation anyway.

In general instruction scheduling is not worth while for non-SIMD and non-Vector code (close to the noise level, but occasionally peeks out).

Over in (SIMD and Vector)-land code scheduling may peek above the noise floor often–especially when related to LDing data into calculation and STing of data after calculation. But even here the Great Big Out of Order processors are doing their best to allow you to ignore the problem.

Remember, the modern processors are using execution windows of size=200-odd for x86 and up to 600-odd for Apple M1. There is a lot of scheduling that happens in windows of this size.

@Mitch_Alsup thank you very much for such insightful comments!