Optimization creates infinite loop

Hello Clang-Developer,

We ran into a – for us big problem – with the Clang compiler. We had the following code:

extern double RT_Time;

void Sample1sec()

{

const double startTime = RT_Time;

while(RT_Time < (startTime + 5.0005));

}

The value of RT_Time will change after a while and the loop is supposed to wait for it. However the assembly code will lead into an endless loop if the condition is not meet the first time:

“?Sample1sec@@YAXXZ”: # @"?Sample1sec@@YAXXZ"

.Lfunc_begin0:

.cv_func_id 0

%bb.0:

.cv_file 1 “C:\Users\h16020\Desktop\Projects\Squirrel\Project\Leviathan\CM_Driver1\CM_Driver1.cpp” “4E92768969662F6E513C11595FEA9D63” 1

.cv_loc 0 1 5 0 # CM_Driver1.cpp:5:0

movsd xmm0, qword ptr [rip + “?RT_Time@@3NA”] # xmm0 = mem[0],zero

.Ltmp0:

#DEBUG_VALUE: Sample1sec:startTime ← $xmm0

movsd xmm1, qword ptr [rip + __real@40140083126e978d] # xmm1 = mem[0],zero

.cv_loc 0 1 0 0 # CM_Driver1.cpp:0:0

addsd xmm1, xmm0

ucomisd xmm1, xmm0

.cv_loc 0 1 6 0 # CM_Driver1.cpp:6:0

jbe .LBB0_2

.Ltmp1:

.p2align 4, 0x90

.LBB0_1: # =>This Inner Loop Header: Depth=1

#DEBUG_VALUE: Sample1sec:startTime ← $xmm0

jmp .LBB0_1

.Ltmp2:

.LBB0_2:

#DEBUG_VALUE: Sample1sec:startTime ← $xmm0

.cv_loc 0 1 7 0 # CM_Driver1.cpp:7:0

ret

.Ltmp3:

.Lfunc_end0:

Marking RT_Time with volatile fixes the problem. But I wonder, why is doing Clang that optimization? RT_Time was not declared ‘const’ so didn’t Clang had to know the value can change?

Is this a bug or do I understand something wrong?

Kind greetings and thank you for the help in advance,

Björn

P.S.

My compile flags (when using -###) are:

clang-cl.exe “-cc1” “-triple” “x86_64-pc-windows-msvc19.16.27034” “-S” “-disable-free” “-disable-llvm-verifier” “-discard-value-names” “-main-file-name” “CM_Driver1.cpp” “-mrelocation-model” “pic” “-pic-level” “2” “-mthread-model” “posix” “-relaxed-aliasing” “-fmath-errno” “-masm-verbose” “-mconstructor-aliases” “-munwind-tables” “-target-cpu” “x86-64” “-mllvm” “-x86-asm-syntax=intel” “-D_MT” “-flto-visibility-public-std” “–dependent-lib=libcmt” “–dependent-lib=oldnames” “-fcxx-exceptions” “-fexceptions” “-fexternc-nounwind” “-fms-volatile” “-fdefault-calling-conv=cdecl” “-fdiagnostics-format” “msvc” “-gcodeview” “-debug-info-kind=limited” “-momit-leaf-frame-pointer” “-ffunction-sections” “-coverage-notes-file” “C:\Users\h16020\Desktop\Projects\Squirrel\Project\Leviathan\CM_Driver1\CM_Driver1.gcno” “-resource-dir” “S:\STSHARE\EXTERNAL\LLVM\lib\clang\9.0.0” “-D” “WIN32” “-D” “NTSYSAD_CM” “-D” “_CONSOLE” “-D” “CODEMODULE” “-D” “__asm=RemoveAssembly” “-D” “_UNICODE” “-D” “UNICODE” “-internal-isystem” “S:\STSHARE\EXTERNAL\LLVM\lib\clang\9.0.0\include” “-internal-isystem” “S:\STSHARE\EXTERNAL\LLVM\Include” “-internal-isystem” “S:\STEXEC\ExecutionEnvironment\INCLUDE” “-internal-isystem” “C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\VC\Tools\MSVC\14.16.27023\include” “-internal-isystem” “C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\VC\Tools\MSVC\14.16.27023\atlmfc\include” “-internal-isystem” “C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\VC\Auxiliary\VS\include” “-internal-isystem” “C:\Program Files (x86)\Windows Kits\10\Include\10.0.10240.0\ucrt” “-internal-isystem” “C:\Program Files (x86)\Windows Kits\8.1\Include\um” “-internal-isystem” “C:\Program Files (x86)\Windows Kits\8.1\Include\shared” “-internal-isystem” “C:\Program Files (x86)\Windows Kits\8.1\Include\winrt” “-internal-isystem” “C:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\Include\um” “-O2” “-WCL4” “-Wno-error” “-Wno-unused-command-line-argument” “-Wno-microsoft-cast” “-Wno-writable-strings” “-Wno-microsoft-enum-forward-reference” “-Wno-invalid-token-paste” “-fdeprecated-macro” “-fdebug-compilation-dir” “C:\Users\h16020\Desktop\Projects\Squirrel\Project\Leviathan\CM_Driver1” “-ferror-limit” “3000” “-fmessage-length” “0” “-fno-use-cxa-atexit” “-fms-extensions” “-fms-compatibility” “-fms-compatibility-version=19.16.27034” “-std=c++14” “-fno-threadsafe-statics” “-fdelayed-template-parsing” “-finline-functions” “-fobjc-runtime=gcc” “-fno-caret-diagnostics” “-fdiagnostics-show-option” “-fno-show-column” “-vectorize-loops” “-vectorize-slp” “-faddrsig” “-o” “x64\Release\CM_Driver1.asm” “-x” “c++” “CM_Driver1.cpp”

clang-cl.exe “-cc1as” “-triple” “x86_64-pc-windows-msvc19.16.27034” “-filetype” “obj” “-main-file-name” “CM_Driver1.cpp” “-target-cpu” “x86-64” “-dwarf-version=4” “-mrelocation-model” “pic” “-mincremental-linker-compatible” “-o” “x64\Release\CM_Driver1.obj” “x64\Release\CM_Driver1.asm”

Hello,

This isn't really surprising. Since RT_Time is not marked volatile,
nor is it accessed through std::atomic functions (or the equivalent
builtins), the compiler has no reason to assume that any external
actor can change the value of it. So the load is moved out of the loop
since doing it repeatedly would, from the compiler's point of view, be
wasteful.

If RT_Time can be changed by hardware or something along those lines,
then volatile is appropriate. If another thread can change it, you
should be using atomic APIs to access it.

Hey,

Thanks for the answer! As said, volatile fixed the issue for us - however you said:
"the compiler has no reason to assume that any external actor can change the value of it"
I would understand that sentence, when the variable would be declared 'const' - but it was not. Isn't the const statement supposed to declare such situations, when the value will not be changed?

That sounds like any extern variable that will be changed should be marked 'volatile'.

Hello,

The compiler is making the (correct) assumption that RT_Time cannot be
changed by anything other than the currently executing thread since
you have not told it otherwise, and since the loop body (or rather,
lack thereof) does not modify it.

What exactly *is* changing RT_Time? Since the thread that executes
Sample1sec() is obviously intended to spin until it changes, I can
only think that either some hardware is changing it or another thread
is. In which case, see my earlier reply. The fact that it's extern
isn't really relevant; I'm fairly confident Clang would generate the
same code if you removed the extern keyword.

Hey,

You are right with your assumption - RT_Time is coming from a different object file and declared global there. You will also find the second thread there, that will modify the value.

I expected to tell Clang with "extern" that stuff is happening to this value in a different source file, and telling it without having the 'const' specifier, that this value will be changed.

Hello,

So in that case you'd want to do something like:

extern double RT_Time;
void Sample1sec()
{
       const double startTime = RT_Time;
       while (__atomic_load_n (&RT_Time, __ATOMIC_RELAXED) <
(startTime + 5.0005));
}

This is assuming you're unable to change the definition side of
RT_Time. If you *are* able to change it, you can (and should) just
make it an std::atomic<double> and use .load()/.store() with
memory_order_relaxed to interact with it. volatile would be the wrong
tool here since you're dealing with threads.

Hey,

Thanks for the input! Does __atomic_load_n not generated additional overhead? In terms of instruction time and so on

Hello,

On most 64-bit targets, loading a double (a 64-bit quantity) from
memory is atomic already, and relaxed memory order means no memory
barriers, so in practice it's really just informing the compiler that
the variable will be accessed from multiple threads.

Of course, on 32-bit targets, a 64-bit atomic load is not the same
thing as a 64-bit non-atomic load. But if you don't do an atomic load
on such targets and instead use volatile, your program would be likely
to encounter word tearing when reading the variable (putting aside the
fact that the program would be illegal due to undefined behavior
resulting from the data race).

To answer the initial question:

> > > extern double RT_Time;
> > >
> > > void Sample1sec()
> > >
> > > {
> > >
> > > const double startTime = RT_Time;
> > >
> > > while(RT_Time < (startTime + 5.0005));
> > >
> > > }

The above is 1 of two things, an infinite loop (nothing else modifies the value), or a race condition (something else modifies the value in a different thread). Race conditions are Undefined Behavior, so the compiler is allowed to assume that it is not the case.

Interestingly, infinite loops are ALSO undefined behavior, so the compiler would be welcome to actually remove the entire loop if it thought it was valuable.

Thank you Alex! I understood that!

Infinite loops are undefined behaviour? I thought they are a "common" thing in hardware...

Yep they are. See the C paper here that has a great justification for it: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1528.htm

Everything but "while(1)" is UB for infinite loops.