How to vectorize a vector type cast?

Since Clang does not seem to allow type casts, such as uchar4 to float4, between vector types, it seems it is necessary to write them as element by element conversions, such as

typedef float float4 attribute((ext_vector_type(4)));

typedef unsigned char uchar4 attribute((ext_vector_type(4)));

float4 to_float4(uchar4 in)

{

float4 out = {in.x, in.y, in.z, in.w};

return out;

}

Running this code through “clang –c –emit-llvm” and then through “opt –O2 –S”, produces the following IR:

define <4 x float> @to_float4(i32 %in.coerce) nounwind uwtable readnone {

entry:

%0 = bitcast i32 %in.coerce to <4 x i8>

%1 = extractelement <4 x i8> %0, i32 0

%conv = uitofp i8 %1 to float

%vecinit = insertelement <4 x float> undef, float %conv, i32 0

%2 = extractelement <4 x i8> %0, i32 1

%conv2 = uitofp i8 %2 to float

%vecinit3 = insertelement <4 x float> %vecinit, float %conv2, i32 1

%3 = extractelement <4 x i8> %0, i32 2

%conv4 = uitofp i8 %3 to float

%vecinit5 = insertelement <4 x float> %vecinit3, float %conv4, i32 2

%4 = extractelement <4 x i8> %0, i32 3

%conv6 = uitofp i8 %4 to float

%vecinit7 = insertelement <4 x float> %vecinit5, float %conv6, i32 3

ret <4 x float> %vecinit7

Which does the cast as a sequence of scalar operations, whereas it could be done as

%1 = uitofp <4 x i8> %0 to <4 x float>

ret <4 x float> %1

It seemed to me that the recently committed basic block vectorizer might be able to do this kind of optimization, but the current version does not do so.

Is this optimization the kind of thing that the bb-vectorizer is intended to be able to do? And, if so, do you have any suggestions as to how that may be done? Or, if not, can you suggest another possible way to parallelize this kind of code?

Thanks,

Preston

Since Clang does not seem to allow type casts, such as uchar4 to float4,
between vector types, it seems it is necessary to write them as element by
element conversions, such as

typedef float float4 __attribute__((ext_vector_type(4)));

typedef unsigned char uchar4 __attribute__((ext_vector_type(4)));

float4 to_float4(uchar4 in)

{

float4 out = {in.x, in.y, in.z, in.w};

return out;

}

I think that's right... we can represent them in IR, but I don't think
clang has a generic way to write them outside OpenCL mode. Granted,
you can use platform-specific intrinsics (_mm_cvttps_epi32 etc.).

Running this code through “clang –c –emit-llvm” and then through “opt –O2
–S”, produces the following IR:

define <4 x float> @to_float4(i32 %in.coerce) nounwind uwtable readnone {

entry:

%0 = bitcast i32 %in.coerce to <4 x i8>

%1 = extractelement <4 x i8> %0, i32 0

%conv = uitofp i8 %1 to float

%vecinit = insertelement <4 x float> undef, float %conv, i32 0

%2 = extractelement <4 x i8> %0, i32 1

%conv2 = uitofp i8 %2 to float

%vecinit3 = insertelement <4 x float> %vecinit, float %conv2, i32 1

%3 = extractelement <4 x i8> %0, i32 2

%conv4 = uitofp i8 %3 to float

%vecinit5 = insertelement <4 x float> %vecinit3, float %conv4, i32 2

%4 = extractelement <4 x i8> %0, i32 3

%conv6 = uitofp i8 %4 to float

%vecinit7 = insertelement <4 x float> %vecinit5, float %conv6, i32 3

ret <4 x float> %vecinit7

Which does the cast as a sequence of scalar operations, whereas it could be
done as

%1 = uitofp <4 x i8> %0 to <4 x float>

ret <4 x float> %1

It seemed to me that the recently committed basic block vectorizer might be
able to do this kind of optimization, but the current version does not do
so.

Yes, that seems reasonable.

-Eli