Windows Lit problems again. What should we do?

Some clang lit tests are failing sporadically on Windows since Friday

Why? Because revision 114187 (Use a temporary file for output which
gets renamed after all the writing is finished.) is causing problem on
Windows:

The command:
FileCheck blabla <%t %s doesn't work on Windows now.

example:

1// RUN: %clang_cc1 -triple i386-apple-darwin9 -emit-llvm -o %t %s
2// RUN: FileCheck --check-prefix CHECK-FRAGILE < %t %s

3// RUN: %clang_cc1 -triple x86_64-apple-darwin10
-fobjc-nonfragile-abi -emit-llvm -o %t %s
4// RUN: FileCheck --check-prefix CHECK-NONFRAGILE < %t %s

The reason is that when "FileCheck < %t %s" is called (line 2), the
lifetime of the handle to file "%t" is controlled by Windows. We don't
know exactly when it is released.
Then clang is called again (line 3) and because of revision 114187,
the clang frontend try to move a file to "%t" but Windows refuses
because the handle from the previous < redirection (line 2) is not
released.

Hence we get error from the bot like:
"error: unable to rename temporary
'D:/public/zorg/buildbot/osuosl/slave/clang-i686-xp-msvc9/llvm/build/tools/clang/test/CodeGenObjC/Output/image-info.m.tmp-000000'
to output file 'D:\public\zorg\buildbot\osuosl\slave\clang-i686-xp-msvc9\llvm\build\tools\clang\test\CodeGenObjC\Output\image-info.m.tmp':
'Can't move 'D:/public/zorg/buildbot/osuosl/slave/clang-i686-xp-msvc9/llvm/build/tools/clang/test/CodeGenObjC/Output/image-info.m.tmp-000000'
to 'D:/public/zorg/buildbot/osuosl/slave/clang-i686-xp-msvc9/llvm/build/tools/clang/test/CodeGenObjC/Output/image-info.m.tmp':
Access is denied."

is that clear?

well I can think of 1 solutions:

1: change all instances of
// RUN: FileCheck (blabla) < %t %s
by
// RUN: FileCheck (blabla.) -input-file %t %s

is that ok or is there a better way?

This is obnoxious, but I'm ok with it.

Daniel?

-eric

Another solution would be to modify lit to use a different temp '%t'
filename for every clang invocation within the same testcase.

Wait, we can create a new %t (before r114187) but not move a file to it ?
Can you try the attached move_fix1.diff and, if it doesn't work, move_fix2.diff ?

-Argiris

move_fix1.diff (585 Bytes)

move_fix2.diff (700 Bytes)

Wait, we can create a new %t (before r114187) but not move a file to it ?

strangely yes.

none of your 2 patches make any difference. Your second patch doesn't
make sense. It delete the source file before renaming it. I suppose
you wanted to delete the destination file. I tried that, also doesn't
work.

Can you try the attached move_fix1.diff and, if it doesn't work, move_fix2.diff ?

The only way I found to make it work is to try to rename, then if it
fail, Sleep(50), then try again. Repeat 3-4 times. then it works.. But
clearly this is not a correct solution.

I just learned that lit has it own Shell parser/interpreter. The
problem might be fixed there I hope.

I am using Process Monitor
(Process Monitor - Windows Sysinternals | Microsoft Docs) to
watch all filesystem activities during a lit python run. I see strange
things.. I'll have more later.

Wait, we can create a new %t (before r114187) but not move a file to it ?

strangely yes.

none of your 2 patches make any difference. Your second patch doesn't
make sense. It delete the source file before renaming it.

Oops!

I suppose
you wanted to delete the destination file.

Yup :slight_smile:

I tried that, also doesn't
work.

Can you try the attached move_fix1.diff and, if it doesn't work, move_fix2.diff ?

The only way I found to make it work is to try to rename, then if it
fail, Sleep(50), then try again. Repeat 3-4 times. then it works.. But
clearly this is not a correct solution.

I just learned that lit has it own Shell parser/interpreter. The
problem might be fixed there I hope.

I am using Process Monitor
(Process Monitor - Windows Sysinternals | Microsoft Docs) to
watch all filesystem activities during a lit python run. I see strange
things.. I'll have more later.

Can you verify whether this can occur outside of lit ?

Implementing rename as a copy+delete fixes the problem... I think we
should go with that on windows.

1: change all instances of
// RUN: FileCheck (blabla) < %t %s
by
// RUN: FileCheck (blabla.) -input-file %t %s

This is obnoxious, but I'm ok with it.

Daniel?

No, I'm not ok with this.

I don't understand what the current problems are, but they should be treated as bugs and solved IMHO.

- Daniel

Another solution would be to modify lit to use a different temp '%t'
filename for every clang invocation within the same testcase.

This doesn't work, many test cases reuse a %t for one reason or other.

The simple solution to all of these problems is to just disable the new tempfile code on Windows.

- Daniel

Can you look at the patch I posted on llvm-commit: doing a rename as a
copy+delete fixes the problem on Windows.

IMO, it is a better solution than disabling the temp code on Windows.
Disabling the new code means clang will behave differently on Windows
vs Unix. + introduce a platform #ifdef inside the driver.

Some clang lit tests are failing sporadically on Windows since Friday

Why? Because revision 114187 (Use a temporary file for output which
gets renamed after all the writing is finished.) is causing problem on
Windows:

The command:
FileCheck blabla <%t %s doesn't work on Windows now.

example:

1// RUN: %clang_cc1 -triple i386-apple-darwin9 -emit-llvm -o %t %s
2// RUN: FileCheck --check-prefix CHECK-FRAGILE < %t %s

3// RUN: %clang_cc1 -triple x86_64-apple-darwin10
-fobjc-nonfragile-abi -emit-llvm -o %t %s
4// RUN: FileCheck --check-prefix CHECK-NONFRAGILE < %t %s

The reason is that when "FileCheck < %t %s" is called (line 2), the
lifetime of the handle to file "%t" is controlled by Windows. We don't
know exactly when it is released.

I don't completely follow the sequence of events causing a failure. I
would need to look closer, but this sounds to me like either a clang
bug or a lit bug.

Presumably one of the two processes is either keeping a handle longer
than it should, or is creating a handle before it should.

I might find some time to investigate this today. Is it deterministic?
Is any particular test most likely to trigger it?

- Daniel

Can you look at the patch I posted on llvm-commit: doing a rename as a
copy+delete fixes the problem on Windows.

IMO, it is a better solution than disabling the temp code on Windows.
Disabling the new code means clang will behave differently on Windows
vs Unix. + introduce a platform #ifdef inside the driver.

I agree in principle, it is nice to have behavior be the same across
platforms. However, this is also intended to be a user-transparent QOI
improvement for Clang and it is closely tied to how Unix handles
files. If the underlying FS in Windows can't support the same QOI
tweak then it is reasonable to not do it.

Copy + delete seems suboptimal to me. It means Clang will be hitting
the FS twice as hard as need be.

My feeling is we should first figure out exactly why this fails, and
disabling the code is a good short term way to get the buildbot to
stop complaining while we investigate.

- Daniel

I might find some time to investigate this today. Is it deterministic?
Is any particular test most likely to trigger it?

The tests that trigger the failure are:
\tools\clang\test\CodeGenObjC\bitfield-access.m
\tools\clang\test\CodeGenObjC\image-info.m
\tools\clang\test\CodeGenObjC\metadata_symbols.m
\tools\clang\test\CodeGenObjC\ns-constant-strings.m

Some observation:
- It is not deterministic. Sometimes all the tests pass, most of the
time at least 1 of them will fail.
- The tests will always pass if run alone.
- The test will always pass if -j=1
- Doing a sleep long enough before renamePathOnDisk will fix the problem.
- changing "< %t %s" with "-input-file=%t %s" will fix the problems.

My theory is that lit, python or windows is holding a handle to '%t'
during the < redirect operation for too long which cause the second
run of clang to fail.

I was thinking of it as a workaround while the problem was investigated, but failed to communicate that part :slight_smile:

Disabling the code (as you suggested later) is totally a better idea.

-eric

I committed the change (copy+delete) because Chris Lattner gave me the
ok and also to unblock the buildbot.
You are right this is suboptimal, I'll revisit the problem soon for a
clean solution.

Hi Daniel.. this lit patch fixes the problem.. I am still trying to
understand exactly why.. I moved the closing of the opened_files
before the procs[-1].communicate statement. I see a FIXME close by!!
any idea?

Would you be ok with that patch?

lit_access_denied.patch (837 Bytes)

Hi Francois,

Excellent find, thank you very very much! I owe you a beer. :slight_smile:

I applied a tweaked version of this patch in r115040 with a slightly
more verbose comment in place about why the closing is done where it
is.

Does this mean that we can revert r114320 now?

- Daniel

Yes I reverted. I did about 20 test run without a single failure.

Great, thanks Francois!

- Daniel