Hi Nate,
1) Vector shl, lshr, ashr
That seems reasonable.
Thanks.
2) Vector strunc, sext, zext, fptrunc and fpext
Again, I think these are hopefully straightforward. Please let me know
if you expect any issues with vector operations that change element
sizes from the RHS to the LHS, e.g. around legalization.
Is the proposed semantics here that the number of elements stays the same size, and the overall vector width changes?
Yes.
3) Vector intrinsics for floor, ceil, round, frac/modf
These are operations that are not trivially specified in terms of
simpler operations. It would be nice to have these as overloaded,
target-independent intrinsics, in the same way as llvm.cos etc. are
supported now.
It seems like these could be handled through intrinsics in the LLVM IR, and could use general improvement in the selection dag.
Right, that's what we were thinking too. Glad to hear that makes sense!
4) Vector select
We consider a vector select extremely important for a number of
operations. This would be an extension of select to support an <N x
i1> vector mask to select between elements of <N x T> vectors for some
basic type T. Vector min, max, sign, etc. can be built on top of this
operation.
How is this anything other than AND/ANDN/OR on any integer vector type? I don't see what adding this to the IR gets you for vectors, since "vector of i1" doesn't mean "vector of bits" necessarily.
Note that I don't mean a "select bits", but rather a "select components" operation. In other words, a straightforward vectorization of the existing "select" IR operation.
You can implement the existing LLVM select instruction with bitwise operations, but it's still convenient to have. For one, as I mentioned, it provides an obvious idiomatic way to express operations like min and max.
Vector selection is a common operation amongst languages that support vectors directly, and vector code will often avoid branches by performing some form of predication instead.
I'm really not that concerned about how it's expressed, but there needs to be a well-understood way to lower something that looks like "vector ?:", vector max, etc in a frontend to something that will actually generate good code in the backend. If the idiom for vector float max is "vfcmp, ashr, bitcast, and, xor, and, or, bitcast" and that generates a single maxps from the x86 backend, great. If the idiom is "vfcmp, select", even better.
5) Vector comparisons that return <N x i1>
This is maybe not a must-have, and perhaps more a question of
preference. I understand the current vfcmp/vicmp semantics, returning
a vector of iK where K matches the bitwidth of the operands being
compared with the high bit set or not, are there for pragmatic
reasons, and that these functions exist to aid with code emitted that
uses machine-specific intrinsics.
I totally disagree with this approach; A vector of i1 doesn't actually match what you want to do with the hardware, unless you had say, 128 x i1 for SSE, and it's strange when you have to spill and reload it.
I definitely am not thinking of <128 x i1>. The intent is really just to express "here is a vector of values where only one bit matters in each element" and have codegen map that to a representation appropriate to the machine being targeted.
Of course these eventually need to be widened to an appropriately sized integer vector, and that vector may be a mask or a 0/1 value, or whatever. The responsibility to doing so can be placed at pretty much any level of the stack, all the way up to making the user worry about it.
For us, making the user worry about it isn't an option; we have first class bools in our frontend, support vectors of these, and naturally define comparisons, selection, etc to be consistent with this. So that leaves it to either the part generating LLVM IR, or the LLVM backend/mid-end. I'm of the tendency that this job is best suited to be done in an SSA representation, and best suited to be done with a specific machine in mind. If this is completely infeasible, we'll have to worry about it during generation of LLVM IR, which is fine. I'd just rather see it done in LLVM where it can hopefully benefit others with the same issues.
To me this isn't much different whether we're talking about vectors or scalars; it's about the utility of an "i1" type generally, and about whose responsibility it is to map such a type to hardware.
The current VICMP and VFCMP instructions do not exist for use with machine intrinsics; they exist to allow code written use C-style comparison operators to generate efficient code on a wide range of both scalar and vector hardware.
OK. My understanding of this was based on an email from Chris to llvm-dev. I had asked how these were used today, especially given the lack of vector shifts. Here's his response:
They can be used with target-specific intrinsics. For example, SSE
provides a broad range of intrinsics to support instructions that LLVM
IR can't express well. See llvm/include/llvm/IntrinsicsX86.td for
more details.
If you have examples on how these are expected to be used today without machine intrinsics, that would really help - e.g. for expressing something like a vector max.
For code that does not use machine intrinsics, I believe it would be
cleaner, simpler, and potentially more efficient, to have a vector
compare that returns <N x i1> instead. For example, in conjunction
with the above-mentioned vector select, this would allow a max to be
expressed simply as a sequence of compare and select.
Having gone down this path, I'd have to disagree with you.
OK. Hopefully my comments above will make my reasoning clearer. If not please let me know. You have a lot more experience with this in LLVM than me so pardon my ignorance :). I'm not suggesting that these semantics are absolutely the way to go, but we do need some way to address these issues.
In addition to the above suggestions, I'd also like to hear what
others think about handling vector operations that aren't powers of
two in size, e.g. <3 x float> operations. I gather the status quo is
that only POT sizes are expected to work (although we've found some
bugs for things like <2 x float> that we're submitting). Ideally
things like <3 x float> operands would usually be rounded up to the
size supported by the machine directly. We can try to do this in the
frontend, but it would of course be ideal if these just worked. I'm
curious if anyone else out there has dealt with this already and has
some suggestions.
Handling NPOT vectors in the code generator ideally would be great; I know some people are working on widening the operations to a wider legal vector type, and scalarizing is always a possibility as well. The main problem here is what to do with address calculations, and alignment.
Right, we've dealt with this in the past by effectively scalarizing loads and stores (since simply extending a load might go beyond a page boundary, etc.), but extending all register operations. I think this addresses the address calculation and alignment issues?
Thanks for the feedback, I hope to hear more.
Stefanus