RFC: Improving lit's debug output

Actually we need a bit more, right now the tests marked with "REQUIRES:shell` would be disabled when the internal shell is used.

There is no mode in lit to say “use the internal shell unless the test requires the external one” I believe?

You’re right. Most test suites appear to set execute_external from that result.

I’m not aware of such a mode. Before implementing one, we should probably check how many tests that have REQUIRES: shell actually need it. My grep shows there are only 126 occurrences. [Edit: Make that 165. I forgot multiple items can appear in a single REQUIRES directive.]

1 Like

Thanks @jdenny-ornl!

With that patch I get the following output for “llvm-lit -a […]”:

-- Testing: 1 tests, 1 workers --
PASS: LLVM :: Transforms/ArgumentPromotion/opaque-ptr.ll (1 of 1)
Exit Code: 0

Command Output (stderr):
--
+ echo 'RUN: at line 2:' '/repo/uabelho/llvm-dev2/llvm/build-all/bin/opt -S -passes=argpromotion < /repo/uabelho/llvm-dev2/llvm/test/Transforms/ArgumentPromotion/opaque-ptr.ll | /repo/uabelho/llvm-dev2/llvm/build-all/bin/FileCheck /repo/uabelho/llvm-dev2/llvm/test/Transforms/ArgumentPromotion/opaque-ptr.ll'
RUN: at line 2: /repo/uabelho/llvm-dev2/llvm/build-all/bin/opt -S -passes=argpromotion < /repo/uabelho/llvm-dev2/llvm/test/Transforms/ArgumentPromotion/opaque-ptr.ll | /repo/uabelho/llvm-dev2/llvm/build-all/bin/FileCheck /repo/uabelho/llvm-dev2/llvm/test/Transforms/ArgumentPromotion/opaque-ptr.ll
+ /repo/uabelho/llvm-dev2/llvm/build-all/bin/opt -S -passes=argpromotion
+ /repo/uabelho/llvm-dev2/llvm/build-all/bin/FileCheck /repo/uabelho/llvm-dev2/llvm/test/Transforms/ArgumentPromotion/opaque-ptr.ll

--

********************

Testing Time: 0.05s
  Passed: 1

so now we really do see the actual command that was run, and I can copy/paste it to reproduce.
I think this is an improvement!

An even bigger improvement would be if we did not get both

+ echo 'RUN: at line 2:' '/repo/uabelho/llvm-dev2/llvm/build-all/bin/opt -S -passes=argpromotion < /repo/uabelho/llvm-dev2/llvm/test/Transforms/ArgumentPromotion/opaque-ptr.ll | /repo/uabelho/llvm-dev2/llvm/build-all/bin/FileCheck /repo/uabelho/llvm-dev2/llvm/test/Transforms/ArgumentPromotion/opaque-ptr.ll'

and

RUN: at line 2: /repo/uabelho/llvm-dev2/llvm/build-all/bin/opt -S -passes=argpromotion < /repo/uabelho/llvm-dev2/llvm/test/Transforms/ArgumentPromotion/opaque-ptr.ll | /repo/uabelho/llvm-dev2/llvm/build-all/bin/FileCheck /repo/uabelho/llvm-dev2/llvm/test/Transforms/ArgumentPromotion/opaque-ptr.ll

but I have no idea if that is doable?

Agreed.

With the patch I suggested applied, here is an example execution trace when the external shell is bash:

+ echo 'RUN: at line 1:' 'echo '\''hello world'\'' | FileCheck bogus.txt && echo success'
RUN: at line 1: echo 'hello world' | FileCheck bogus.txt && echo success
+ echo 'hello world'
+ FileCheck bogus.txt
Could not open check file 'bogus.txt': No such file or directory

I’ve thought of a few possible solutions:

  1. Drop set -x entirely
    • New execution trace:
      : 'RUN: at line 1'; echo 'hello world' | FileCheck bogus.txt && echo success
      Could not open check file 'bogus.txt': No such file or directory
      
    • Benefit: Much less clutter on stderr. There would be no + lines. There would be only the full RUN: at line N: command lines plus anything they print to stderr. We could even change it to : 'RUN: at line N'; command (as shown above) so the whole thing could be copied and pasted into a terminal. Then, multiple RUN lines could be copied at once, as long as the commands wrote nothing to stderr in between.
    • Problem: Execution tracing within each RUN line will be lost. That can make it hard to see which commands were executed and which failed in a series joined by &&, ||, |, etc. Given that we’re talking about external shells, there might be more complex shell constructs we’d like to trace too, like (...), {...}, $(cmd), etc.
    • Workaround: The user can add RUN: set -x to a test when they need the full execution trace. I suspect they would not do this permanently in a test as that would circumvent our goal here. Instead, I imagine they would add it temporarily when needing to debug a failed test, but that’s less helpful when debugging failed tests from unattended runs, in CI for example.
  2. Replace echo with :
    • New execution trace:

      + : 'RUN: at line 1:' 'echo '\''hello world'\'' | FileCheck bogus.txt && echo success'
      + echo 'hello world'
      + FileCheck bogus.txt
      Could not open check file 'bogus.txt': No such file or directory
      
    • Benefit: We lose only the echoed RUN: at line N: command but not its corresponding + line or any other + line.

    • Problem: Quoting must be removed from 'command' in the remaining : 'RUN: at line N:' 'command' when copying it to a terminal. That’s not so hard if command is simple, but it’s going to be tedious and error-prone if command itself contains special characters (as in the above example). For this reason, I think I dislike this solution the most.

  3. Use a trick like { set +x; } 2> /dev/null and { set -x; } 2> /dev/null
    • New execution trace:

      RUN: at line 1: echo 'hello world' | FileCheck bogus.txt && echo success
      + echo 'hello world'
      + FileCheck bogus.txt
      Could not open check file 'bogus.txt': No such file or directory
      
    • Benefit: We could use that to suppress only the + echo 'RUN: at line N:' 'command' line while keeping the echoed RUN: at line N: command and all other + lines. The {...} 2> /dev/null part avoids seeing tracing of the set command itself, which would just replace one form of clutter with another.

    • Potential problem: Is it portable to all sh shells? It seems to work for /bin/sh (dash) and /bin/bash on my system. Is there a corresponding windows cmd trick, or will we have to invent another solution there [edit: @echo is the solution there]? If you like this solution, please try it out at a few shell prompts on your system to confirm it works. Here’s a full test with expected output:

      $ /bin/sh # or whatever shell
      $ set -x
      $ echo before; { set +x; } 2> /dev/null; echo a RUN line; { set -x; } 2> /dev/null; echo after
      + echo before
      before
      a RUN line
      + echo after
      after
      

If I were deciding by myself, I’d probably try to land 1 first because it seems to strike a nice balance between ease of implementation and usability while still pushing us to migrate to lit’s internal shell for a better execution trace. But please give your feedback as I might have missed important use cases again.

I realize there’s some subtlety in the above descriptions, and I’m happy to answer questions if I’m not being clear.

2 Likes

Seems good to me:)

1 Like

I’m struggling to compare these options without seeing “BEFORE” vs “AFTER”. From a quick glimpse, I also like the the simplicity of Option1. However,

I’m quite worried about this part. Especially in the context of CI failures on hardware that one has no access to (something like that happened to me recently). There are cases in which these traces are the only avilable indicators of what went wrong.

-Andrzej

I’ve added examples to that post. Let me know if anything is still unclear.

Understood. I had a similar issue recently in CI. There’s some debug info that feels like noise most of the time, but when you need it and you have no obvious way to get it, it’s frustrating.

Would perhaps be nice to get the first step done (to echo the RUN line, even if it also echoes the echo cmd itself).

As @mikaelholmen says such a patch seem to bring back some of the output from --echo-all-commands that went missing. Filtering away the extra echo of the echo command is more like icing on the cake (and that is something that could be done later if it is complicated).

Makes sense. I just created a PR for that at https://github.com/llvm/llvm-project/pull/65267.

I was trying to see if I can switch the MLIR test suite to use the internal shell, but it fails on 10 tests, apparently because it does not like the syntax ENV_VAR=VALUE cmd:

# executed command: TENSOR0=llvm-project/mlir/test/Integration/data/wide.mtx llvm-project/build/bin/mlir-cpu-runner -e entry -entry-point-result=void -shared-libs=llvm-project/build/lib/libmlir_c_runner_utils.so,llvm-project/build/lib/libmlir_runner_utils.so
# .---command stderr------------
# | 'TENSOR0=llvm-project/mlir/test/Integration/data/wide.mtx': command not found
# `-----------------------------
# error: command failed with exit status: 127

It interpret the environment definition as the command to run. How should we rewrite this to be friendly to lit internal shell?
(here is one of the test: https://github.com/llvm/llvm-project/blob/main/mlir/test/Integration/Dialect/SparseTensor/CPU/sparse_matvec.mlir )

Try: env ENV_VAR=VALUE cmd

1 Like

So I was hoping https://github.com/llvm/llvm-project/pull/65267 would help, but it looks like it was reverted and the current situation is that the lit output is very broken. I’ve been seeing a failure locally on a change I’m working on and the commands no longer print anything even remotely actionable. In test/tools/dxil-dis/opaque-value_as_metadata.ll, we have a run line like so:

; RUN: llc --filetype=obj %s -o - 2>&1 | dxil-dis -o - | FileCheck %s

But lit now prints this without any of the redirects or pipes and also out of order:

Command Output (stderr):
--
+ : 'RUN: at line 1'
+ /Users/bogner/code/llvm.org/build/bin/llc --filetype=obj /Users/bogner/code/llvm.org/llvm/test/tools/dxil-dis/opaque-value_as_metadata.ll -o -
+ /Users/bogner/code/llvm.org/build/bin/FileCheck /Users/bogner/code/llvm.org/llvm/test/tools/dxil-dis/opaque-value_as_metadata.ll
+ /Users/bogner/code/llvm.org/build/bin/dxil-dis -o -
/Users/bogner/code/llvm.org/llvm/test/tools/dxil-dis/opaque-value_as_metadata.ll:25:15: error: CHECK-DAG: expected string not found in input

whereas if I git revert b6bd9d275f78 012d844fb856 3db5db92d746 f254bbf23374 c981c533055e I get the much more actionable:

Script:
--
: 'RUN: at line 1';   /Users/bogner/code/llvm.org/build/bin/llc --filetype=obj /Users/bogner/code/llvm.org/llvm/test/tools/dxil-dis/opaque-value_as_metadata.ll -o - 2>&1 | /Users/bogner/code/llvm.org/build/bin/dxil-dis -o - | /Users/bogner/code/llvm.org/build/bin/FileCheck /Users/bogner/code/llvm.org/llvm/test/tools/dxil-dis/opaque-value_as_metadata.ll

Should we just revert all of these changes until they can be done without significant regressions?

Sorry for the trouble. I plan to revert everything associated with this RFC today. When I get more feedback on my fix for PR #65267 and the proposal to avoid duplication in the external shell execution trace, I’ll try to re-land everything, but it might be a while.

Sounds good. Thanks for working on this!

I’ve pushed the reverts, ending at 9f111d990b781409c00700fc1b0d151572507107.

I am watching the bots. I pushed all the reverts at once, but it appears some bots are testing one revert at a time, so we’re going to see some bots fail until they catch up to the current HEAD.

1 Like

I have resurrected the patches that implement the ideas in this RFC:

I have rebased them all onto a recent version of main. I think only the last needs another review, and I think it addresses all remaining concerns I saw. Once it’s accepted, I will land all three at the same time.

2 Likes

I landed these yesterday:

  • 1495d51ee9e71fa58b523ec1703b3e258283481f
  • f223022a4c521296a6d058862d66443ab9e355bf
  • fba457d77e59bff91fe3e01dcaec6cf3d2b5b556
  • aa71680f2948ac177144e1089c58d55b9ac0cef2
  • 30d77fb80857e645b300c8f59cad9414d090e083

The bots seem to be ok. Unless there are additional usability issues, I think the work for this RFC is done.

2 Likes