Wall time misleading in `--mlir-timing`

The timing reported as “wall time” in --mlir-timing can be a bit confusing in the presence of multithreading. When a pass is executed across multiple threads, the timing statistics of each thread are combined through addition to produce “user time”, and maximum to produce “wall time”.

Consider two threads (1 and 2) both running passes A and B over different parts of a module.

  • Thread 1: A taking 2s, B taking 1s, 3s overall
  • Thread 2: A taking 1s, B taking 2s, 3s overall

Once the parallel part of the pass manager is over, the timing statistics of the passes are combined as follows:

  • Pass A: user 2s+1s=3s, wall max(2s,1s)=2s
  • Pass B: user 1s+2s=3s, wall max(1s,2s)=2s

If you timed this from the outside, you would see the pass manager terminate after 3s. However the report would show pass A and B taking 2s wall time each (hiding the fact that some overlapped). If you summed up the wall times in the report, you’d arrive at 4s, which is quite confusing given that time tells you it’s 3s.

@clattner suggested that average could be an alternative to maximum here. Another option would be to somehow scale “wall time” to sum up to the proper total execution time, or rename it to something different than “wall time” to not confuse people (maybe “sequential/parallel” instead of “user/wall”)?

Calculating “wall time” as anything else than the equivalent of looking at a clock on the wall before you start and after you end (at whatever scope is being timed) is fraught with peril and should probably not be called “wall time”. So I think renaming would make sense.

One thing that might help is providing other statistics on the parallel part of the execution:
Total wall clock time, number of threads, amount of time threads spend stalled, etc.

I agree that calling it something other than ‘wall time’ is important.