Implicit Defs and Uses are ignored by pre-RA schedulers

Hello,

In our Kalray LLVM backend, we have builtins to get and set system registers. One of them is $CS, which has sticky bits enforcing rounding mode or storing masked floating-point exceptions. The equivalent on AArch64 would be FPCR.

In our user code, we would like to preserve the partial ordering between a SET to $CS and a floating-point operation, since the SET to $CS might be modifying the rounding mode. Similarly, we would like to preserve the partial ordering between a GET from $CS and a floating-point operation, since a user code might want to examine the floating-point exception bits right after a given floating-point operation.

Another use-case we have is the following: we have a coprocessor that is turned on by setting a given bit on a system register. This can be accessed by a builtin. Such SET instruction must happen before using a coprocessor instruction - the compiler should not break that dependency when reordering instructions.

We have tried to implement this by using implicit Defs and implicit Uses in our instruction definitions, using for example `Defs = [CS] in` and `Uses = [CS]` where relevant in our Target Description files.

I have been running some experiments, examining the scheduling outputs and the dependencies (using VLIWScheduler in pre-RA, PostRASchedulerList in post-RA, and a child of VLIWPacketizerList for bundling).

I have found that the implicit defs and uses are indeed taken into account by the post-RA schedulers. However, they seem to be ignored by the pre-RA schedulers. Also, they do not appear as dependencies in the SelectionDAG.

If I look at what some other backends did, AArch64 does not seem to model anything on FPCR. PowerPC sets MFFS as scheduling barrier (isSchedulingBoundary) to prevent floating-point instructions being ordered above it - but isSchedulingBoundary seems to be only used by post-RA schedulers; pre-RA schedulers do not seem to care about that.

The bad consequence for us: our programmers have to encapsulate the SET instructions (touching system registers) in non-inlined functions to enforce the compiler not breaking anything.

We are looking for advice on how to treat this problem - we have possible leads, like modifying the SelectionDAG to recover these dependencies, or modifying the schedulers to scan the SelectionDAG and enforce the source order when such dependency is detected (maybe by having a look at how SourceScheduler works), but we have not yet investigated it fully.

Any such advice would be greatly appreciated

Also, another related issue: it would seem that the flag -ffp-exception-behavior=strict does not preserve the exception semantics like it says it does. Although the generated IR seems to preserve it, there does not seem to be anything in the LLVM backends enforcing the "strict" floating-point exception behavior.

That last point can be witnessed in that piece of code: Compiler Explorer

long fpcr; 

int toto(float a, float b, float c, double d, double e){ 
float bc = b + c; // first faddd 
asm("mrs %[result], FPCR" : [result] "=r" (fpcr) : :); 
float abc = a + bc; // second faddd 
float dw = (float) d; // fwidenlwd : should not happen before the second faddd 
float ew = (float) e; 
int dw_ewl = (int) dw + (int) ew; 
int abcl_dw_ewl = (int) abc + dw_ewl; 
return abcl_dw_ewl; 
} 

Compiling this piece of code with clang 11.0.0 for ARMv8-a gives the following assembly code:

toto: 
fadd s1, s1, s2 
fcvt s2, d3 
fadd s0, s1, s0 
fcvt s3, d4 
fcvtzs w9, s2 
fcvtzs w10, s0 
add w9, w10, w9 
fcvtzs w10, s3 
add w0, w9, w10 
adrp x9, fpcr 
//APP 
mrs x8, FPCR 
//NO_APP 
str x8, [x9, :lo12:fpcr] 
ret 

Notice that mrs was moved below - which does not seem to preserve the floating-point exception semantics of the compiled code.

PS : apologies for the double message if any ; I sent the first to llvm-dev-bounces by mistake

Best regards,

Cyril Six
Compiler Engineer • Kalray
Phone:
csix@kalrayinc.com • [ https://www.kalrayinc.com/ | www.kalrayinc.com ]

[ https://www.kalrayinc.com/ | ]

Please consider the environment before printing this e-mail.
This message contains information that may be privileged or confidential and is the property of Kalray S.A. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.

Did you try hasSideEffects = 1?

I’m not familiar with AArch64. On X86, we have separate FPCR and FPSR. The former is used for control (rounding, exception mask) and the latter is for status. We modeled all FP instructions that may raise exception by mayRaiseFPException = 1 and using FPCR. Note, the read of FPCR instruction is another use instead of def FPCR. So it’s not necessary to keep the order of read instruction ahead as source order. Only the write FPCR does. I guess it is the same reason for AArch64? Maybe you can have a check on the write of FPCR.

Correct. You do need to add the required support to your backend.

The X86, PowerPC, and SystemZ backends have basically complete support.

The PowerPC backend has a fix to not reschedule floating-point instructions

around function calls if the rounding mode may change. I haven’t heard

that the other two have this fix. AArch64 and RISC-V support are both a

work in progress so one of the three fully-supported targets is best to

examine and emulate.

Also be aware that optimization of strict floating-point is a work in

progress, so be prepared for not-so-great performance.

Lastly, there’s currently no way to have machine-specific llvm intrinsics

respect “strict” mode. A fix has been proposed, but I don’t think anything

has been implemented.

It might have been clang 12 where a warning was introduced that told you

that “strict” floating-point doesn’t work for that target and is therefore

disabled. I don’t remember exactly which release first had this.

Thanks a lot for the replies,

Did you try `hasSideEffects = 1`?

I’m not familiar with AArch64. On X86, we have separate FPCR and FPSR. The former is used for control (rounding, exception mask) and the latter is for status. We modeled all FP instructions that may raise exception by `mayRaiseFPException = 1` and using FPCR. Note, the read of FPCR instruction is another use instead of def FPCR. So it’s not necessary to keep the order of read instruction ahead as source order. Only the write FPCR does. I guess it is the same reason for AArch64? Maybe you can have a check on the write of FPCR.

Thanks

Phoebe

On our end, hasSideEffects = 1 and mayRaiseFPException = 1 (combined with implicit Defs and Uses of our $CS register) do not seem to be enough to prevent the reordering of floating-point instructions in pre-RA scheduling.

Correct. You do need to add the required support to your backend.

The X86, PowerPC, and SystemZ backends have basically complete support.

The PowerPC backend has a fix to not reschedule floating-point instructions

around function calls if the rounding mode may change. I haven't heard

that the other two have this fix. AArch64 and RISC-V support are both a

work in progress so one of the three fully-supported targets is best to

examine and emulate.

Also be aware that optimization of strict floating-point is a work in

progress, so be prepared for not-so-great performance.

Lastly, there's currently no way to have machine-specific llvm intrinsics

respect "strict" mode. A fix has been proposed, but I don't think anything

has been implemented.

It might have been clang 12 where a warning was introduced that told you

that "strict" floating-point doesn't work for that target and is therefore

disabled. I don't remember exactly which release first had this.

--
Kevin P. Neal
SAS/C and SAS/C++ Compiler

Compute Services

SAS Institute, Inc.

Thank you for the answer - it confirms what I have been seeing.

I will take a closer look to these backends, especially PowerPC's fix to not reschedule floating-point instructions above function calls.

Cyril S