`libm` --- Conformance and timing CI (for GPUs)

As we are moving to build a libm for the GPU, I would like to setup CI for it as well.
I’m interested in correctness and timing results, for “all versions” we provide, e.g., the vendor ones and the builtin ones, maybe later also other “special” ones.

Based on these results we can then set our defaults and advise users through a public webpage.
I’m asking for feedback on this in general, and the following questions in particular:

  • Should we set this up in the LLVM test suite or separately?
  • Should we run the experiments also on our CPUs, potentially contrasting them there against other libm solutions?
  • Where should we collect and publish the results?

I have some ideas for the above but I wanted to avoid biasing people.
Feel free to let me know we should not do this at all as part of LLVM, in that case we might only set it up downstream.

(Tag: @lntue @sivachandra @jhuber6 @AnastasiaStulova @arsenm @Artem-B @bader @antonrydahl @EthanLuisMcDonough)

I’m in favor of having some testing for this living in-tree. As far as I understand, @lntue already maintains a differential testing suite at llvm-project/libc/test/src/math/differential_testing at main · llvm/llvm-project · GitHub. We could potentially port that to the GPU. There’s likely some performance counters we could use to replace the user of the C++ standard headers for chrono. @arsenm might have a good idea there.

Beyond that the only testing I’m familiar with on the GPU are the OpenCL conformance tests OpenCL-CTS/test_conformance/math_brute_force at main · KhronosGroup/OpenCL-CTS · GitHub.

If we abstract the chrono part behind a function interface, we could probably implement it with omp_get_wtime and use direct GPU compilation, or the loaders we have, to just reuse the tests pretty much as they are.
Using perf counters would also be cool, especially just to have portable wrappers around them.
For now, time is probably sufficient.

Porting the tests should work as the UnitTest framework is self contained and maintained by us. We should be able to bring it to the device entirely. Depending on how many syscalls it really needs, we might go with RPC wrappers or rewrite some functionality for the GPU.

The implementation there is pretty minimal so we could copy it over to prevent the dependency on OpenMP. However that’s not implemented for AMDGPU. Apparently ASO has an implementation for it as follows, however I’m pretty sure the only counter this can map to only has a resolution of 2^20 ticks, so it’s not sufficient for long tests.

double getWTime() {    
#if __gfx700__ || __gfx701__ || __gfx702__    
  uint64_t t = __builtin_amdgcn_s_memtime();    
#elif __gfx1100__ || __gfx1101__ || __gfx1102__ || __gfx1103__    
  uint64_t t = __builtin_readcyclecounter();    
#else    
  uint64_t t = __builtin_amdgcn_s_memrealtime();    
#endif    
  return ((double)1.0 / 745000000.0) * t;    
}
  • Since we are talking about math functions provided by libm in particular, I don’t see any problems maintaining performance testing frameworks/tooling in the libc project. Even better would be if we are able to set up some CI like testing to catch regressions.
  • The libc project promises to produce correctly rounded results for all rounding modes. In other words, the libc project’s math implementations will produce that unique and most accurate result of a floating point operation for all inputs and in all rounding modes. We already compare results produced by the libc project’s implementations with corresponding functions from MPFR to ensure that that is indeed the case. We can practically do this exhaustively for single precision operations but the plan is to set up something practical that can give us mathematical confidence for operations on higher precision numbers as well. For the implementations tuned for the GPU, we should be clear about why such comparisons would be useful, especially when GPU implementations consciously trade accuracy with other aspects (like reducing branching).
  • For the current CPU focused implementations, we already publish them on the libc project’s website. We can similarly publish GPU focused information.
1 Like