[RFC] f32 to bf16 truncation harmonization

I’ve got a patch up ( ⚙ D156362 [mlir][Arith] Change F32 to BF16 truncation to match __truncsfbf2 ) that changes the implementation of arith.truncf in ExpandOps to match __truncsfbf2 in the execution engine. @rsuderman approved this change, but warned about potential controversy. So, before I land that, I’d like to check for objections.

(I’ll note that LLVM seems to have an even simpler implementation of truncation to bf16 in https://github.com/llvm/llvm-project/blob/1783185790de29b24d3850d33d9a9d586e6bbd39/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp#L3230 , but I might be missing context)

What is the obstacle to supporting different rounding modes for architectures that support it? (I’m trying to understand whether this is a case of “not wanting to clutter the operation with a rounding mode” vs. “no one has yet stepped up to contribute it.”)

If someone added more systematic tracking of the rounding mode to MLIR or something like that, then we could clutter software truncf to account for that rounding mode.

On the other hand, current code (including LLVM proper) generally uses the “quick” truncate scheme and so I’d like to harmonize on it.