[cc’ing cfe-dev because this may require some interpretation of language law]
My understanding is that the compiler has the freedom to access extra data in C/C++ (not sure about other languages); AFAIK, the LLVM LangRef is silent about this. In C/C++, this is based on the “as-if rule”:
http://en.cppreference.com/w/cpp/language/as_if
So the question is: where should the optimizer draw the line with respect to perf vs. security if it involves operating on unknown data? Are there guidelines that we can use to decide this?
The masked load transform referenced below is not unique in accessing / operating on unknown data. In addition to the related scalar loads → vector load transform that I’ve mentioned earlier in this thread, see for example:
https://llvm.org/bugs/show_bug.cgi?id=20358
(and the security paper and patch review linked there)
Regarding accessing extra data, there are at least some limits as to what can be accessed. You can’t generate extra loads or stores to volatiles. You can’t generate extra stores to atomics, even if the extra stores appear to be the same value as the old value.
As for determining where the perf vs. security line should be drawn, I would argue that most compilers have gone too far on the perf side while optimizing undefined behavior. Dead store elimination leaving passwords in memory, integer overflow checks getting optimized out, and NULL checks optimized away. Linus Torvalds was complaining about those just recently on this list, and while I don’t share his tone, I agree with him regarding the harm these optimizations can cause.
If I’m understanding correctly, for your specific cases, you are wondering if it is fine to load and operate on a floating point value that the user did not specifically request you to operate on. This could cause (at least) two different problems. First, it could cause a floating point exception. I think the danger of the floating point exception should rule out loading values the user didn’t request. Second, loading values the user didn’t specify could enable a timing attack. The timing attack is scary, but I don’t think it is something we can really fix in the general case. As long as individual assembly instructions have impractical-to-predict execution times, we will be at the mercy of the current hardware state. There are timing attacks that can determine TLS keys in a different VM instance based off of how quickly loads in the current process execute. If our worst timing attack problems are floating point denormalization issues, then I think we are in a pretty good state.
I agree with Craig. This reference here https://nebelwelt.net/publications/files/15LangSec.pdf
talks about the issues related to accessing unintended extra memory and also suggest some
solution that can be provided in the compiler to give user some flrxibility to choose from perf vs security.
Shahid
Hi Ben -
Thanks for your response. For the sake of argument, let’s narrow the scope of the problem to eliminate some of the variables you have rightfully cited.
Let’s assume we’re not dealing with volatiles, atomics, or FP operands. We’ll even guarantee that the extra loaded value is never used. This is, in fact, the scenario that http://reviews.llvm.org/rL263446 is concerned with.
Related C example:
typedef int v4i32 attribute((vector_size(16)));
// Load some almost-consecutive ints as a vector.
v4i32 foo(int *x) {
int x0 = x[0];
int x1 = x[1];
// int x2 = x[2]; // U can’t touch this?
int x3 = x[3];
return (v4i32) { x0, x1, 0, x3 };
}
For x86, we notice that we have nearly a v4i32 vector’s worth of loads, so we just turn that into a vector load and mask out the element that’s getting set to zero:
movups (%rdi), %xmm0 ; load 128-bits instead of three 32-bit elements
andps LCPI0_0(%rip), %xmm0 ; put zero bits into the 3rd element of the vector
Should that optimization be disabled by a hypothetical -fextra-secure flag?
I’m by no means an expert on security, and I’m not even going to discuss the implications of that, but surely a load of, say, 4 x i32 with or without mask will still load those at least into the cache - whether the are then loaded into some hardware register inside the CPU or not is an implementation detail for the given processor, but one could imagine that on some hardware, the masking is done by a suitable AND of a constant, so even if a masked-load is used, it’s not guaranteed to NOT read that data. The PROCESS can certainly read that data anyway, so my rather naive view of how secure software works seems to think that this is a very extreme case of worrying about something that gdb
can see, for example.
I’m having a hard time finding any problems here, at least as long as the value is in the middle. I wouldn’t expect the contents of x[2] to affect the timing or power usage of anything. I guess there would be a minor “bad” side effect in that a memory read watchpoint would trigger with the 128 bit load that wouldn’t be there with the 32-bit loads. I think it is semantically very similar to this situation as well…
v4i32 first_call(int *x) { //use all of the array
int f0 = x[0];
int f1 = x[1];
int f2 = x[2];
int f3 = x[3];
return (v4i32) { f0, f1, f2, f3 };
}
v4i32 second_call(int *x) { //use some of the array
int s0 = x[0];
int s1 = x[1];
int s2 = 0;
int s3 = x[3];
return (v4i32) { s0, s1, s2, s3 };
}
first_call(x);
second_call(x);
The implementation isn’t going to zero out the stack in between those calls, so for a short period of time, the memory location of s2 will contain x[2].
I’m less sure if the gaps are on the edges. I’m worried that you might ending up crossing some important address boundary if you look at something earlier or later than what the user requested.
We are careful not to try this optimization where it would extend the range of loaded memory; this is purely for what I call a “load doughnut”. 
Reading past either specified edge would be very bad because it could cause a memory fault / exception where there was none in the original program. That’s definitely not legal.
I'm less sure if the gaps are on the edges. I'm worried that you might
ending up crossing some important address boundary if you look at something
earlier or later than what the user requested.
Oh yes, you certainly can't do it on the edges in most cases. It can
very easily cross a page boundary and segfault your process.
Cheers.
Tim.
Hypothesizing about architectures with larger-than-cacheline vectors, there’s also the possibility of changing externally observable behavior by loading a catchline-sized donut hole.
I believe that a very conservative set of criteria that nonetheless license this almost all the time is something like
- both ends of vector are used
- vector is smaller than a cacheline
- not FP, or arch doesn’t raise flags on FP loads (most)
- not volatile or atomic
– Steve
GVN will widen loads and stores up to a power of two. We’ve had numerous problems with this. Our architecture supports byte-granularity memory protection, so often this widening will cause a trap (if you try to access byte 4 of a 3-byte array, then you will get bad behaviour). Even without this, it often causes us to generate very bad code. Two adjacent 32-bit stores are replaced by a single 64-bit store, only now the store is not correctly aligned (the compiler knows that the stores have 4-byte alignment).
On MIPS, the store case isn’t too bad if we tell the compiler that the target doesn’t support unaligned accesses, because it then generates a pair of sdl / sdr instructions (which doesn’t actually save us anything over a pair of sw instructions, and is slower on some implementations). If we tell the compiler that we do support unaligned loads and stores (which we’d like to, because they’re sufficiently rare that we get better code in most cases if we trap and emulate the few where the compiler assumed alignment that wasn’t really there) then we get a trap and emulate it. This was particularly fun in the FreeBSD kernel, where the code that triggered this ‘optimisation’ was the code that handled unaligned load and store exceptions…
For the load case, it’s significantly worse, because doing a couple of loads from within the same cache line is very cheap, doing a single load and then some masking is more expensive. On a simple pipeline, it costs more simply because you have more instructions (and a load from L1 is very cheap). On a more complex pipeline, you’ve turned a sequence of independent operations into a sequence with a dependency and so you lose out on ILP. If the two loads span a cache line boundary (as the two stores did in the case that we saw) then you’re now hitting a *really* slow microcode path on a lot of recent processors, when the naïve code generation would have produced a small number of micro-ops.
In short, I am very nervous of adding more of these optimisations, because the ones that we do today are not always sensible.
David
Hypothesizing about architectures with larger-than-cacheline vectors, there’s also the possibility of changing externally observable behavior by loading a catchline-sized donut hole.
I believe that a very conservative set of criteria that nonetheless license this almost all the time is something like
- both ends of vector are used
- vector is smaller than a cacheline
As illustrated by David’s followup, this should really be “vector is smaller than than granularity of cache and memory protection on targeted architecture."
Not if the load is aligned. Which you can easily guarantee if you’re doing “load a big thing and mask/extract”, which means you’re generating the load address for the big thing by clearing the lower bits of the address that would have been used for the small thing.
Alignment should also be taken into account. Consider the memory locations:
ABCDEFGH
If you have a vector ABCD and a vector EFGH, of which you access C and F, then you might consider creating a vector CDEF. The vector is smaller than a cache line, but still may span a cache line boundary. I’ve not looked at the latest generation, but with earlier AVX implementations an unaligned load or store could be a factor of around 30 slower than an aligned one. You may end up with significantly slower code if you attempt to coalesce these two memory operations.
David
We are careful not to try this optimization where it would extend the
range of loaded memory; this is purely for what I call a "load doughnut". 
Reading past either specified edge would be very bad because it could
cause a memory fault / exception where there was none in the original
program. That's definitely not legal.
Perhaps this PR would interest you:
25474 – Copy construction performs unsolicited reads of volatile derived-class members in padding.