Strange types on x86 vcvtph2ps and vcvtps2ph intrinsics

Hi,

I was looking at the x86 vector intrinsics for converting half
precision floating point numbers and I'm a bit confused as to why
certain types were chosen. I've gone ahead and used their current
definition with success but I'd like to understand why the types used
with these intrinsics are done this way.

For reference see ``include/llvm/IR/IntrinsicsX86.td``. Here are the
intrinsics of interest.

let TargetPrefix = "x86" in {  // All intrinsics start with "llvm.x86.".
  def int_x86_vcvtph2ps_128 : GCCBuiltin<"__builtin_ia32_vcvtph2ps">,
              Intrinsic<[llvm_v4f32_ty], [llvm_v8i16_ty], [IntrNoMem]>;
  def int_x86_vcvtph2ps_256 : GCCBuiltin<"__builtin_ia32_vcvtph2ps256">,
              Intrinsic<[llvm_v8f32_ty], [llvm_v8i16_ty], [IntrNoMem]>;
  def int_x86_vcvtps2ph_128 : GCCBuiltin<"__builtin_ia32_vcvtps2ph">,
              Intrinsic<[llvm_v8i16_ty], [llvm_v4f32_ty, llvm_i32_ty],
                        [IntrNoMem]>;
  def int_x86_vcvtps2ph_256 : GCCBuiltin<"__builtin_ia32_vcvtps2ph256">,
              Intrinsic<[llvm_v8i16_ty], [llvm_v8f32_ty, llvm_i32_ty],
                        [IntrNoMem]>;

Here's what seems weird to me:

* For the 4 wide intrinsics (``_128`` suffix) some of the types are
wider than they need to be. For example ``int_x86_vcvtph2ps_128``
takes <8 x i16> as an argument but this intrinsic only uses the first
four lanes so why is the argument type not <4 x i16>?
``int_x86_vcvtps2ph_128`` also has the same oddity but on its return
type (returns <8 x i16> but only the first four are relevant).

* The use of ``i16`` types also seems a little strange given that the
more semantically correct ``f16`` type and vectorized forms (e.g.
``llvm_v4f16_ty``) are available. Sure I can use a bitcast with the
intrinsics to get the type I want in the IR but why were ``i16`` was
chosen over using ``f16``?

Any ideas?

Thanks,
Dan.

Hi Dan,

Hi,

I was looking at the x86 vector intrinsics for converting half
precision floating point numbers and I'm a bit confused as to why
certain types were chosen. I've gone ahead and used their current
definition with success but I'd like to understand why the types used
with these intrinsics are done this way.

For reference see ``include/llvm/IR/IntrinsicsX86.td``. Here are the
intrinsics of interest.

let TargetPrefix = "x86" in {  // All intrinsics start with "llvm.x86.".
  def int_x86_vcvtph2ps_128 : GCCBuiltin<"__builtin_ia32_vcvtph2ps">,
              Intrinsic<[llvm_v4f32_ty], [llvm_v8i16_ty], [IntrNoMem]>;
  def int_x86_vcvtph2ps_256 : GCCBuiltin<"__builtin_ia32_vcvtph2ps256">,
              Intrinsic<[llvm_v8f32_ty], [llvm_v8i16_ty], [IntrNoMem]>;
  def int_x86_vcvtps2ph_128 : GCCBuiltin<"__builtin_ia32_vcvtps2ph">,
              Intrinsic<[llvm_v8i16_ty], [llvm_v4f32_ty, llvm_i32_ty],
                        [IntrNoMem]>;
  def int_x86_vcvtps2ph_256 : GCCBuiltin<"__builtin_ia32_vcvtps2ph256">,
              Intrinsic<[llvm_v8i16_ty], [llvm_v8f32_ty, llvm_i32_ty],
                        [IntrNoMem]>;

Here's what seems weird to me:

* For the 4 wide intrinsics (``_128`` suffix) some of the types are
wider than they need to be. For example ``int_x86_vcvtph2ps_128``
takes <8 x i16> as an argument but this intrinsic only uses the first
four lanes so why is the argument type not <4 x i16>?
``int_x86_vcvtps2ph_128`` also has the same oddity but on its return
type (returns <8 x i16> but only the first four are relevant).

One reason is that <4 x i16> is too small to be a legal SSE vector
type, so the IR intrinsics, much like the Intel C intrinsics and the
instructions, are defined in terms of the widened <8 x i16> (with
either __m128, or xmm registers).

* The use of ``i16`` types also seems a little strange given that the
more semantically correct ``f16`` type and vectorized forms (e.g.
``llvm_v4f16_ty``) are available. Sure I can use a bitcast with the
intrinsics to get the type I want in the IR but why were ``i16`` was
chosen over using ``f16``?

f16 wasn't, until recently, very well supported. It still has rough
edges on targets without native scalar register classes such as X86.

Instead, these targets use i16, and do the conversion with other
(native) FP types using the dedicated convert.to/from.fp16 intrinsics.
We match that here and use an i16 element type.

Someday, we'll get rid of these intrinsics and use half everywhere,
but we're not there yet!

HTH,
-Ahmed

Hi,

Here's what seems weird to me:

* For the 4 wide intrinsics (``_128`` suffix) some of the types are
wider than they need to be. For example ``int_x86_vcvtph2ps_128``
takes <8 x i16> as an argument but this intrinsic only uses the first
four lanes so why is the argument type not <4 x i16>?
``int_x86_vcvtps2ph_128`` also has the same oddity but on its return
type (returns <8 x i16> but only the first four are relevant).

One reason is that <4 x i16> is too small to be a legal SSE vector
type, so the IR intrinsics, much like the Intel C intrinsics and the
instructions, are defined in terms of the widened <8 x i16> (with
either __m128, or xmm registers).

Ah I see. Makes sense.

* The use of ``i16`` types also seems a little strange given that the
more semantically correct ``f16`` type and vectorized forms (e.g.
``llvm_v4f16_ty``) are available. Sure I can use a bitcast with the
intrinsics to get the type I want in the IR but why were ``i16`` was
chosen over using ``f16``?

f16 wasn't, until recently, very well supported. It still has rough
edges on targets without native scalar register classes such as X86.

What do you mean by "register classes"? Sorry if this is a dumb question.

Instead, these targets use i16, and do the conversion with other
(native) FP types using the dedicated convert.to/from.fp16 intrinsics.
We match that here and use an i16 element type.

I remember seeing that intrinsic in the language reference but
unfortunately ``convert.to.fp16`` [1] isn't useful
for what I'm working on because it doesn't specify a rounding mode.
fp16 has so little precision that the rounding mode
**really matters**.

Someday, we'll get rid of these intrinsics and use half everywhere,
but we're not there yet!

Okay.

HTH,

Very helpful, thanks.

[1] http://llvm.org/docs/LangRef.html#llvm-convert-to-fp16-intrinsic

Thanks,
Dan.

Hi,

Here's what seems weird to me:

* For the 4 wide intrinsics (``_128`` suffix) some of the types are
wider than they need to be. For example ``int_x86_vcvtph2ps_128``
takes <8 x i16> as an argument but this intrinsic only uses the first
four lanes so why is the argument type not <4 x i16>?
``int_x86_vcvtps2ph_128`` also has the same oddity but on its return
type (returns <8 x i16> but only the first four are relevant).

One reason is that <4 x i16> is too small to be a legal SSE vector
type, so the IR intrinsics, much like the Intel C intrinsics and the
instructions, are defined in terms of the widened <8 x i16> (with
either __m128, or xmm registers).

Ah I see. Makes sense.

* The use of ``i16`` types also seems a little strange given that the
more semantically correct ``f16`` type and vectorized forms (e.g.
``llvm_v4f16_ty``) are available. Sure I can use a bitcast with the
intrinsics to get the type I want in the IR but why were ``i16`` was
chosen over using ``f16``?

f16 wasn't, until recently, very well supported. It still has rough
edges on targets without native scalar register classes such as X86.

What do you mean by "register classes"?

Here's an improvised vague definition: a register class is the set of
registers that can be used interchangeably in some specific context.

So, in this case, on X86 (see lib/Target/X86/X86RegisterInfo.td), we
have the 128-bit uses of xmm registers (part of the VR128 register
class), but also the scalar equivalents (in e.g. "addsd %xmm"):
FR32/FR64.

Since we have no way of copying/storing/etc.. the lowest 16-bits of an
xmm register, any f16 scalar will need to be legalized, usually to
f32.

Given that clang, for __fp16, only generates loads, stores, and
conversions (via libcalls), it's simpler and usually more efficient to
instead represent half values as i16, as i16 can be
loaded/stored/passed (since we do have i16 register classes
(%ax/%r9w/etc.. in GR16)).

Now, one can argue that using i16 instructions for f16 is different
from representing __fp16 as i16 in IR. That's legitimate, but, until
recently, we haven't needed __fp16 for anything other than the above,
so why bother with the additional complexity. Again, this will
change, hopefully soon!

Sorry if this is a dumb question.

There's no such thing :wink:

-Ahmed