We’ve recently open-sourced a set of optimized AArch32 functions for floating-point arithmetic, for 32-bit Arm processors without hardware FP. They’re available on Github in the fp
subdirectory of the Arm Optimized Routines repository. These were the functions used in the libraries of the proprietary Arm Compiler toolchain, which is now at end of life.
We’d like to import these functions into compiler-rt, superseding the generic C implementations (and potentially the small number of handwritten asm ones too), because we expect them to run significantly faster in general. I haven’t benchmarked all of them, but I’ve spot-checked a few: a rough estimate of the overall improvement is between 1.5× and 2×.
(Of course if we find out during the process that one of the functions isn’t faster after all, we’ll skip that one.)
This is a big enough exercise that it seemed worth asking for opinions first.
Is this welcome at all?
Does anyone have objections to importing these functions at all? We’d prefer to get overall ‘concept approval’ for the project before making a lot of PRs (although we do have a sample draft PR linked below).
Possible objections I can think of:
Licensing. The functions in AOR are dual-licensed between MIT and LLVM’s license, on purpose, so they’re license-compatible. AOR code has been imported into LLVM before.
Code size. The functions are fast, but sometimes at the cost of some space, because of aggressive versioning (separate into multiple cases and optimize each one separately). We never considered the code size excessive in the context of Arm Compiler, but of course people have different opinions. The factor varies between functions, but as a rough average figure, the total size of our implementations of +−×÷ in both single and double precision is about 1.2× the size of the corresponding functions in compiler-rt.
Maintenance cost. Of course one C implementation is less work to maintain than that and an assembler version, especially if changes need to be made by hand in assembler that you’d get for free in C, like support for codegen-affecting CPU features (e.g. PAC and BTI). We think the added performance is worth it, and since these are leaf functions which don’t store arrays on the stack, not all of those codegen-affecting CPU features will be needed anyway (PAC and BTI in particular).
Detailed strategy?
Supposing we do import these functions, how would people like it done?
Is it better to put AOR functions unmodified in their own subdirectory, so that it’s easy to import updated versions from AOR later without hand-merging, and then find a way to tie that into the compiler-rt/lib/builtins
build system? (We don’t currently have plans to update the AOR versions constantly, but you never know.)
Or is it better to convert the functions into the existing compiler-rt
assembly style, leaving little or no trace of their external origin? (For example, using macros like DEFINE_COMPILERRT_FUNCTION
and LOCAL_LABEL
.)
A draft pull request for one function, using the second approach of converting the code into local style, is already up for review. But I’ll abandon that and do it another way if people prefer.
In case people are concerned about code size, should we add a convenient cmake option to select the AOR routines or the (mostly C) existing ones? Or is it OK to just unconditionally replace the existing code?
Does anyone in particular want to be involved in reviewing the patches?