Dave,
Unifying array and vector and generalizing the result would open a lot
of optimization opportunities.
you would be piling an incomplete optimization on top of a pile of already
incomplete optimizations... Vectorization in Fortran is already a "hard" problem,
requiring alias analysis (always an incomplete and inaccurate (conservative)
analysis) and loop-carried array subscript dependence analysis (which is
equivalent to the Diophantine Equation problem in Mathematics which is in
general not solvable, so you end up again with incomplete and inaccurate
(conservative) analysis). Doing this in C (without first class array/matrix
types), and with even more alias analysis issues, makes it more problematic.
The final straw with all these multimedia instruction sets is that they require
large alignment on their "packed array" data types (even Intel which started
out not requiring alignment with MMX (though unaligned data invoked hugh
performance penalties), did evolve with SSE to what everyone else requires).
data alignment can (and should!) be analyzed within the same algorithms that do
alias analysis, and the analysis has the same inherent limitations.
The problem is that in real world applications it is typical for array data slices
(ie sections of arrays that are passed to subroutines to be processed) to be unaligned,
even if the array base address is aligned, the bounds of the section being processed
are in general not aligned.
You end up with wanting to clone your algorithm kernel for various incoming
alignments (just like memcpy, memcmp, etc are often cloned internally for
various relative alignments of the incoming arguments), but with a kernel
accessing N different arrays you end up needing 2**N clones, which in general
is an impractical code-explosion.
The reason I object to the use of "vector" and "simd" when describing these
"packed data" multimedia instruction sets is that in practical reality the traditional
vectorization optimization technology just does not apply. You can always
come up with geewiz examples where it does, but you cannot make it work
in the general case.
No matter what fancy data shuffling/permuting/inserting/extracting instructions
get added to MMX/SSE/SSE2, they will still not solve the data alignment problem,
so the instruction sets remain incompatible with "traditional vector machines"
where there was always one-data-item-per-HW-register and there was never
any alignment issue.
best,
Peter Lawrence.