[RFC] Improved AArch32 FP arithmetic from Arm Optimized Routines

,

We’ve recently open-sourced a set of optimized AArch32 functions for floating-point arithmetic, for 32-bit Arm processors without hardware FP. They’re available on Github in the fp subdirectory of the Arm Optimized Routines repository. These were the functions used in the libraries of the proprietary Arm Compiler toolchain, which is now at end of life.

We’d like to import these functions into compiler-rt, superseding the generic C implementations (and potentially the small number of handwritten asm ones too), because we expect them to run significantly faster in general. I haven’t benchmarked all of them, but I’ve spot-checked a few: a rough estimate of the overall improvement is between 1.5× and 2×.

(Of course if we find out during the process that one of the functions isn’t faster after all, we’ll skip that one.)

This is a big enough exercise that it seemed worth asking for opinions first.

Is this welcome at all?

Does anyone have objections to importing these functions at all? We’d prefer to get overall ‘concept approval’ for the project before making a lot of PRs (although we do have a sample draft PR linked below).

Possible objections I can think of:

Licensing. The functions in AOR are dual-licensed between MIT and LLVM’s license, on purpose, so they’re license-compatible. AOR code has been imported into LLVM before.

Code size. The functions are fast, but sometimes at the cost of some space, because of aggressive versioning (separate into multiple cases and optimize each one separately). We never considered the code size excessive in the context of Arm Compiler, but of course people have different opinions. The factor varies between functions, but as a rough average figure, the total size of our implementations of +−×÷ in both single and double precision is about 1.2× the size of the corresponding functions in compiler-rt.

Maintenance cost. Of course one C implementation is less work to maintain than that and an assembler version, especially if changes need to be made by hand in assembler that you’d get for free in C, like support for codegen-affecting CPU features (e.g. PAC and BTI). We think the added performance is worth it, and since these are leaf functions which don’t store arrays on the stack, not all of those codegen-affecting CPU features will be needed anyway (PAC and BTI in particular).

Detailed strategy?

Supposing we do import these functions, how would people like it done?

Is it better to put AOR functions unmodified in their own subdirectory, so that it’s easy to import updated versions from AOR later without hand-merging, and then find a way to tie that into the compiler-rt/lib/builtins build system? (We don’t currently have plans to update the AOR versions constantly, but you never know.)

Or is it better to convert the functions into the existing compiler-rt assembly style, leaving little or no trace of their external origin? (For example, using macros like DEFINE_COMPILERRT_FUNCTION and LOCAL_LABEL.)

A draft pull request for one function, using the second approach of converting the code into local style, is already up for review. But I’ll abandon that and do it another way if people prefer.

In case people are concerned about code size, should we add a convenient cmake option to select the AOR routines or the (mostly C) existing ones? Or is it OK to just unconditionally replace the existing code?

Does anyone in particular want to be involved in reviewing the patches?

5 Likes

As maintainer of TinyGo (which uses compiler-rt), I’d be happy to see these optimized functions even if it increases code size a little. And I’d prefer if they would integrate with the existing build system (replacing existing functions) instead of being separate and needing special handling for ARM.

1 Like

I’m happy to be a reviewer for the patches. As mentioned in the draft PR, I have been involved in the review in Arm Optimized Routines so I would have a limited amount of comments on the assembly implementation. I can add comments on compiler-rt integration.

From the Maintainers list for compiler-rt llvm-project/compiler-rt/Maintainers.md at main · llvm/llvm-project · GitHub @compnerd is the builtins maintainer. May be worth tagging in the reviews to see if there are any objections.

I’m supportive of this proposal and my preference would be the second approach, that is to match the existing compiler-rt style. We tried the first approach for libc and I don’t think it has worked out well in practice.

The additional benefit of C implementation—in addition to smaller size—is also the possibility of using LTO to inline the implementation of these routines. Whether that would result in any meaningful benefits is not yet clear, but it’s a direction our team is actively exploring, so I’d prefer keeping the C implementations around for the time being.

Thanks for responding!

Yes, I certainly wasn’t planning to delete the C implementations. They’ll still be needed on non-Arm platforms, and some of them even on Arm. (We only have a subset of the functions in Thumb1 assembly, so the C implementations of everything else will still be used in that context.)

I just meant: should I make sure there’s a convenient cmake option to go back to the C implementations even on a platform where the optimized versions are available?

I guessed that was probably what you’d say, but I wanted to check :slight_smile:

It is generally preferable to use C to keep the code portable, but if these improve performance, I think that this is valuable enough to do that. It might take a little white just to go through the code and get it merged, but I would be in favour of getting it integrated so that everyone can benefit from this work without having to integrate it manually. These functions are pretty low level and we do need to be aware of ABI, but I think all of this is purely technical minutiae that we can work through. In general, I think that we want to have a simple option of ..._ENABLE_ASM that can be set to NO to flip back to the C implementations, though we want the default behaviour to use the optimised variant.

In short, no objections, happy to work with you to get this integrated :slight_smile: