LLVM Optimizations strange behavior/bug

Hi,

during a reverse engineering challenge I used clang/llvm optimizations to minimiz some code and I found some strange behavior that I can’t reproduce with GCC or CL (visual studio compiler).

The C code function contains some code that operates on a data array and an supplied password from ARGV. The compiled binary works as long as I don’t activate any optimizations. When I activate the optimization (>= -O1) then the code will be optimized into some constants which sounds great at the beginning but this is not right. I can reproduce this with clang 3.9 and 4.0. GCC 5.4 and VS CL >=2015 do not show this behavior.

.text:0000000000400570 ; __int64 __fastcall DecryptBlock(unsigned __int8 *)

.text:0000000000400570 public DecryptBlock(unsigned char *)

.text:0000000000400570 DecryptBlock(unsigned char *) proc near ; CODE XREF: main+5p

.text:0000000000400570 mov cs:byte_60106F, 54h

.text:0000000000400577 mov cs:byte_60106E, 0CDh

.text:000000000040057E mov cs:byte_60106D, 0BFh

.text:0000000000400585 mov cs:byte_60106C, 1Bh

.text:000000000040058C mov cs:byte_60106B, 0E4h

.text:0000000000400593 mov cs:byte_60106A, 28h

.text:000000000040059A mov cs:byte_601069, 56h

.text:00000000004005A1 mov cs:byte_601068, 0ACh

.text:00000000004005A8 mov rax, 0F61EA263E1103088h

.text:00000000004005B2 mov cs:Plaintext, rax

.text:00000000004005B9 retn

.text:00000000004005B9 DecryptBlock(unsigned char *) endp

Any idea if this is a bug or why clang does show this behavior ?

Thanks,

Peter Garba

I’ve attached the sample code to the mail. Please ignore the comments and the style of the code :wink:

main.cpp (4.58 KB)

The code is pretty rife with undefined behaviour. Casting a "char *"
pointer to an "unsigned *" and dereferencing it violates strict
aliasing (actually just doing the cast is dodgy, but usually not a
problem in practice).

When I change those lines to use memcpy instead and compile with
-fsanitize=undefined, apparently 4 of the shift operations are
shifting by a negative amount (also undefined behaviour). I expect
Clang is marking those as undef and simplifying everything down to a
constant based on that.

Certainly I start getting non-constant results when I fix those. Also,
beware that shifting a signed int is only valid if the input is
positive and the result is still fits, and you can only shift from 0
to the 1 less than the bit-width of the type. Generally you almost
always want to do bitwise fiddling on unsigned quantities because of
that first one.

Cheers.

Tim.

What you see here is a 1:1 translation of RISC-V assembler code into C code.

It may contain some obfuscations like shifting by negative amount but this I why I use compiler optimization to remove such patterns.
But with the provided sample clang/llvm seems to be too optimistic with optimizations compare to GCC and CL and I really would like to
get this one fixed to use the output LLVM IR for further optimization on the optimized code and get the same behavior as the other compilers.

Thanks,
Peter

What you see here is a 1:1 translation of RISC-V assembler code into C
code.

The built-in operator semantics apparently differ then between C and RISC-V
assembler then.
Thus the "1:1 translation" needs to take that into account.

It may contain some obfuscations like shifting by negative amount but this
I why I use compiler optimization to remove such patterns.

An optimizing compiler operates on a program based on the semantics of the
language (in this case, C).
Shifting by a negative amount is undefined behaviour in C.

But with the provided sample clang/llvm seems to be too optimistic with
optimizations compare to GCC and CL and I really would like to
get this one fixed to use the output LLVM IR for further optimization on
the optimized code and get the same behavior as the other compilers.

It does not sound like it is a bug. Either end (the input program or the
compiler) can change.
You may want the compiler to change (in which case, you may want to explore
implementing options/modes which "fix" the semantics of certain cases of
undefined behaviour).
Perhaps a more pragmatic solution is to add functions in the input program
which implement the RISC-V semantics in C (and call those).
You may also want to disable strict aliasing.

a shift by a negative amount is probably OK, so long as you understand
the consequences.
E.g. For a lot of CPUs (e.g X86):
1(32bit) << 37 == 32
(What's happening is the shifter in the CPU just gets the low five bits.)
If you write this in C code, you would expect:
1(32bit) << 37 == 0

So, when representing that in C code.
You really need to represent it as:
1(32bit) << (37 & 0x1f) == 32 (Make the C representation equivalent
to the ASM instruction.)
Similarly, you would need to add the "AND" into the LLVM bitcode.

Kind Regards

James