Fixed-Point Integer Operations

Working on lowering TOSA’s quantized operations has encountered some common sub-patterns related to executing quantized models, specifically fixed-point integer operations. Some common patterns that appear are saturating operations (op+clamp) and rescaling operations (mul+round+shift).

With the current set of StandardOps we can lower most of TOSA’s cases to i64 operations with multiple instructions. For instance a fixed point rescale would appear as:

// Extend to 64-bit
%in_64 = sexti %in : i32 to i64
%multiply_64 = sexti %multiply : i32 to i64
%shift_64 = sexti %shift : i32 to i64

// Multiply and downshift
%scaled = muli %in_64, %multiply_64 : i64
%result_64 = shift_right_signed %scaled, %shift : i64

// Truncate
%result = trunci %result_64 : i64 to i32

This solution works for the TOSA operations however it results in repeated appearance of the same set and depends on LLVM pattern matching composite operations. E.g. ARM has some CPU intrinsics that handle some of these cases and having specific IR to target these ops is beneficial. Furthermore, as some target platforms don’t support i64 integers but do support these fixed operations natively, having a specific lowering path could be beneficial.

The other problem is testing lowerings that use fixed-point operations is difficult to interpret test correctness as large composite operations often hide semantic meaning of the lowering. Sets of 3-4 operations are often representing over-flow safe cases when the set I/O of the set is 32-bit but the internal behavior needs to be 64-bit to guarantee correctness.

Given fixed-point operations are going to be common for Quantized models and TOSA is likely to not be the only dialect creating fixed-point operations, having a universal method for representing fixed point operations is beneficial. Would a separate fixed-point dialect make sense in this case? How would it be best to represent these types of operations to work well with CPU intrinsics and the general LLVM backend?

1 Like

A very interesting thread, @rsuderman ! Quite some time ago while prototyping TOSA to LinAlg, I recall getting stuck with how exactly to express rescale and other fixed point scaling ops, so I find this idea really interesting.

It appears the goal here is to be able to express op-carried quantization / fixed point scaling operations effectively. In other worse, such a dialect would at least enable the explicit expression of the functional pseudocode of each TOSA operator as listed in the TOSA spec (⚡ TOSA) , and perhaps any other dialect wanting to explicitly express op-carried fixed-point scaling related operations.

In the past @stellaraccident mentioned considering these in the QuantOps dialect, but that may no longer be the right place to express these ops now. Generally I think a dialect that enables the unambiguous expression of TOSA pseudocode sequences here would be beneficial.

How would such a dialect interact with LinAlg and Standard, which recently had conversions from TOSA implemented for them ?

If I were you, I would start exactly as you are – using std ops to represent the computation correctly, if not optimally as a first step. Then (possibly) in parallel, it would be really interesting to properly design a proper fxpmath dialect with primitives that concisely represent some of the high level operations that show up and are designed to lower well to instructions present on real (DSP and other) targets.

There are far better experts than myself hanging around here on the specifics of such an op set, but I’ve often used the gemmlowp fixedpoint.h library as a pretty readable set of operations and mappings to low level ISAs for the kind of math that tends to show up in this area.

So I have a patch that handles everything except the i48 types (some modifications are required but they should still be representable). Overall my method of sign-extending and doing the math in higher bit-width does address the issue however there are a few shortcomings.

  • Having minimum bitwidth represented would be nice. I put everything to 64-bit but its feasible that mixed bitwidth variants should determine the minimum required scale.
  • If lowered directly to LLVM, some platforms that emulate 64-bit operations will execute assuming full 64-bit values on the inputs, making operations like Multiply taking 2x the cycles. I am not certain LLVM will be able to pattern match / optimize these cases.

It depends on the compilation path. I would expect a fixedpoint-to-standard lowering could handle the bit-widening and composite operations to work. This would handle cases where a target CPU does not have specifically mapped intrinsics. As for CPUs that have specific operations, it would likely require some fixed-point-to-XXX-cpu lowering to exploit specific operations.

I’d also be interested in seeing this. There’s been a few C++ standard proposals in recent years and HLS tools have modeled this kind of thing using C++ classes. The main tradeoff here is that there are several common ways to implement this:

  1. compile-time bitwidths and compile-time fixed-point location. This provides alot of static safety, but requires some careful thought about some things, like the precision required for division.
  2. compile-time bitwidths, but runtime variable fixed-point location. This allows precision to vary and can be used for some types of computation where dynamic ranges can vary, e.g. ‘block floating-point’.
  3. run-time bitwidths and fixed-point location. No safety, but enables exploring different precision options without recompiling.

#1 is most useful for HLS and most machine learning applications, but some architectures would prefer #2. #3 is more useful for design-time exploration.

1 Like

Creating an fxpmath dialect and seeding it with some of the critical operations seems like a great direction. We can accompany all the ops with expansions to std arithmetic (in many cases direct expansion to NEON dialect, AVX512 dialect, etc. first is probably better).

This is one of those cases where prematurely going to fully expanded form introduces a “raising” problem later (in this case probably with LLVM ISel) that we can entirely circumvent.

I’m especially worried about the raising in the vectorized case (harder for LLVM ISel, due to vector constants and lane width changes and such), which happens to be the most important for us.

I think fxpmath can have a small set of ElementwiseMappable ops (just like std arithmetic ops) that linalg can then automatically vectorize, and then we can pattern match to NEON/SVE/AVX512/etc. dialects before going to LLVM.

@bjacob is the fixed point / quantized ML arithmetic expert in my neck of the woods. I’ll see if we can work with him to seed the new dialect.

Do you have any links? would be great if there was something we can build off of.


FYI, a couple of years ahead of its time (and I would do it entirely differently today): llvm-project/ at 8189e6ef908e35d90bee4d13972ce9bb6c8d8966 · llvm/llvm-project · GitHub

(I deleted this in the early days when we were cleaning up some of our original prototypes)


GitHub - johnmcfarlane/cnl: A Compositional Numeric Library for C++ has a bunch of links. One of the fundamental design decisions at the C++ level is also how you think of the resulting fixed point types Alot of this affects conversions from other types. They can be:

  1. an interpretation of a bit pattern in an integer, In this case it’s possible to take existing fixed point code written with integers and ‘reinterpret’ bit-patterns as a fixed point number, enabling them to be more easily manipulated.
  2. A more efficient replacement for a ‘float’ So creating a fixed point number from an integer or a floating point number involves a conversion. interoperability is key here. HLS datatypes (e.g. Using Arbitrary Precision Data Types ac_types/ac_int.h at master · hlslibs/ac_types · GitHub) take this approach
  3. A ‘concept-like’ type where fixed point data types are parameterized by a base type and add an ‘exponent’ This is ‘nice’ from a template concept perspective, but is actually hard to define and use in practice in my experience. Unfortunately, most of the c++ standard proposals go down this route because it’s ‘pure’.