As we discussed in the following topics, we found that the nsw flag which stands for “no signed wrap” helps some optimizations. On the contrary, integer overflow is prohibited by the Fortran standard so the flags should be added to arithmetic integer operations as Clang does.
My prototype implementation seems to work well but I’d like to confirm the design has no problem. Mainly, I have two concerns about the implementation.
Options to switch the behavior
There was a discussion that we should add the nsw flag under an option because we’ve not determined the impact on optimizations yet. @tblah told me that I could try -Ofast, but I’m not sure -ffast-math is suitable for this case since it’s the option for floating-point operations. Fortunately, Clang (and Gfortran) have the options for this case, -f[no-]wrapv and -f[no-]strict-overflow. I think the flags can be added to the operations when -fno-wrapv or -fstrict-overflow is specified.
FYI: Gfortran also has the option -fno-wrapv-pointer and enables -fno-wrapv and -fno-wrapv-pointer if -fstrict-overflow is specified. On the other hand, Clang doesn’t have the option -fno-wrapv-pointer and that leads both options are almost the same. I think Flang will follow Clang.
Flang doesn’t accept these options at the moment so I should introduce them into Flang at first. Note that -fno-wrapv should be enabled by default but -fwrapv will be enabled by default until we determine the impact on optimizations.
My implementation only works in LoweringBridge which lowers the parse tree to HLFIR, so the flags aren’t added to operations in transformation passes. IIUC, there isn’t a structure passing the information of options to them so the flags would be always added to operations. I’d like to make this RFC focus on introducing the options into Flang and adding the flags using the options and ignore this issue.
How to implement
There are also some tiny concerns about implementation.
I referred to the Clang implementation for my implementation and then Flang came to add the flags to operations in simple expressions. (FYI: I implemented the feature only in the gen function for binary operations in HlfirBuilder.) However, the flags aren’t added to some operations in strucured operations such as DoLoopOp and that means I have to add flags to operations everywhere they are created though I think it’s hard. So I’m going to split the patch into small pieces and post bit by bit. (Of course, your help is welcome.)
On the other hand, I found that I can apply the implementaion for the fast-math flag to that for the nsw flag and it seems to add flags to all operations easily in lowering.
IMHO, the feature should be implemented like Clang, but I’m not sure. So, I’d like your comments.
My implementation seems to work well and the flags are added to all arithmetic operations.
If there is no objection to this RFC, I’ll create PRs next week.
Though I can add the flag to all arithmetic integer operations, I’m not sure all operations should have the nsw flag when -fno-wrapv is enabled considering the meaning of the option. It doesn’t mean that some operations shouldn’t have the flag. I’d like to say that whether some operations have the flag wouldn’t depend on this option.
IIUC, The option is related to the language specification. According to 10.1.5.2.4 in the Fortran 2018 standard, the execution of any numeric intrinsic operation whose result is not defined by the arithmetic used by the processor is prohibited. However, some internal operations such as DO-variable increment aren’t referred to. Therefore, I suspect that the flag should be added only to such numeric intrinsic operations when the option is specified. In other words, I would implement as originally planned.
To add nsw flag to DO-variable increment, I would introduce another new option -flang-experimental-integer-overflow into Flang. This option doesn’t add the flag to all operations to keep consistent with -fwrapv. (In addition, it is also because it’s hard to pass the information of options to transformation passes as I said before.) -flang-experimental-integer-overflow will be enabled by default and deleted in the future. On the other hand, -fno-wrapv will also be enabled by default but -f[no-]wrapv will remain.
I would appreciate it if you could give me some comments.
I created the PR for -flang-experimental-integer-overflow (adding nsw to do-variable increment).
On the other hand, it was found that the patch for -fno-wrapv (adding nsw to numeric intrinsic operations) had a problem. The patch added nsw to operations which can overflow. I’m considering the solution, but it would take some more time to investigate.
In the last call, I told that I was thinking of using integer range analysis and adding flags to operations that are found not to overflow. However, it was found that the idea wouldn’t work well for my target loop because the loop bounds aren’t constant.
So, I would try adding the flag only to index calculations of fir.array_coor or fircg.ext_array_coor. To keep the transformation safe, I would introduce some constraints on it.
The bit sizes of the values of the array bounds are less than or equal to those of the results of index calculations
The indices of the array are linear experssions (a*i+b)
I’d like to ask whether the constraints are sufficient.
Thank you for your comments in the last call. I’d like to organize my idea and provide some examples.
The targets of this implementation are indicies as the input of ext_array_coor, not operations that ext_array_coor includes internally. In other words, the new pass for this would run before FIRToLLVMLowering pass.
The following is an example of codes and generated LLVM IR.
The addition generating %14 in LLVM IR can have nsw, and this implementation would add nsw to it.
As I said, I would introduce some constraints on the inputs of ext_array_coor. The following code is an example that shouldn’t have nsw. (It’s not a practical code, but it would be accepted.)
In the LLVM IR, the addition generating %34 can have nsw while the one generating %28 cannot because i+128 as signed integer might overflow. AFAIK, however, there is no way to confirm it at the moment. I suspect programmers don’t recognize that OUT_OF_RANGE has a code which can overflow. Therefore, I conservatively think all operations in this index calulation should have no nsw even though index overflow is non-conforming.
This implementation would accept only the following index patterns. (a is an array, m and n are constants, and i is a variable.)
a(m*i)
a(i+n)
a(m*i+n)
As you know, however, this idea has an obvious defect. It ignores many patterns which can have nsw actually. (e.g. a((m+i)*n)) I started with the above conditions to make the implementation as simple as possible.
It seems I misunderstood before and this will involve some pattern matching on arguments to array_coor. @jeanPerier had concerns about this.
Are the particular index patterns you are allowing motivated by real code that does vectorise with other fortran compilers?
From my perspective, the problem here is that I don’t feel qualified to judge whether the non-conforming behavior is relied upon by Fortran programs in the wild. Perhaps if it could be demonstrated that other widely-used compilers already make similar assumptions, this would give some confidence that these transformations will not break existing (although non-conforming) programs.
I have seen real performance improvements from using -fno-wrapv and so I would be very happy if we can figure out something that could be enabled by default. Thank you for your work so far!
Thank you @yus3710-fj for taking the time to illustrate.
My small concern with IR based pattern matching was that many thing may have happened on the IR before you try to add this flag, and I mainly had CSE in mind.
For instance, in the example below with a(i+128) = merge(i, 0, out_of_range(i,1_1)), the adds from the OUT_OF_RANGE and the one from the addressing could be fused.
In this case, the fact i+128 appeared in the addressing should imply that OUT_OF_RANGE add cannot overflow, so it is safe to add nsw anyway. But one could think of cases were CSE could merge the adds with OUT_OF_RANGE reachable and the addressing unreachable in case of overflow, in which case adding nsw on the CSEed add would be incorrect. Here is an example below (CSE currently does not happen, but nothing technically prevents it).
Are the particular index patterns you are allowing motivated by real code that does vectorise with other fortran compilers?
I found one of the loops in TSVC (s243) is the case. Flang doesn’t vectorize it while GFortran does. Flang can vectorize it if index calculation (i+1) has the flag.
From my perspective, the problem here is that I don’t feel qualified to judge whether the non-conforming behavior is relied upon by Fortran programs in the wild. Perhaps if it could be demonstrated that other widely-used compilers already make similar assumptions, this would give some confidence that these transformations will not break existing (although non-conforming) programs.
I apologize if I have misunderstood your concerns.
I previously mentioned that the flag would always be added to index calculations regardless of the -fno-wrapv option. However, I don’t have concrete evidence to support this claim. It would be difficult to confirm other widely-used compilers do the similar things. Therefore, I now believe that the -fno-wrapv option is necessary to control the behavior of adding nsw to index calculations.
Then, the constraints could be relaxed considering the meaning of -fno-wrapv. The option means nsw is added to operations written in source codes explicitly. Therefore, the constraint regarding bounds of arrays wouldn’t be necessary.
You can however rule out my CSE concern by checking for single usage of the add results.
That looks good, but I suspect that this condition might be too restrictive. An index calculation shared with some arrays wouldn’t have the flag.
I’m considering checking if the uses of the addition are only for index calculations. Additionally, I’m contemplating whether the feature should be implemented in the LoweringBridge or immediately after it to avoid transformations on the IR.
I’ve investigated and it looks like many other compilers do perform this optimization. I modified the loop a bit to rule out any kind of intelligent integer range analysis and they all still vectorized the loop, and so I believe they must be assuming that i+1 does not overflow. This sort of reasoning has been used before when discussing enabling other optimizations by default.
I’m happy so long as Jean’s concerns are addressed.
I separated do-variable increments from the targets of -fno-wrapv and introduced a new option (-flang-experimental-integer-overflow) for them. However, I found that this approach does not align with the behavior of GFortran.
I conducted an experiment with a loop that requires nsw on the do-variable increment for vectorization. Compiler Explorer
Flang vectorizes the loop when -flang-experimental-integer-overflow is specified but not otherwise. On the other hand, GFortran vectorizes the loop when -fno-wrapv is specified but not when -fwrapv is specified.
This indicates that -flang-experimental-integer-overflow should be encompassed within -fno-wrapv. In other words, -fno-wrapv can add nsw to not only “numeric operations” in source codes but also other operations in IR.
Therefore, I think the following patches are needed:
Adding nsw to index calculations.
Introducing -fno-wrapv into Flang (disabled by default).
Integrating -flang-experimental-integer-overflow into -fno-wrapv.
I’ve investigated the behavior of -fno-wrapv of GFortran.
(Please note that I didn’t check the source code of GFortran. I only observed the behavior with some Fortran codes.)
In conclusion, GFortran seems to add the flag to all operations except for ones in arguments of intrinsic functions for bitwise comparison (BGT, BLE, and so on).
Perhaps, constraints are not necessary but the flag in arguments of such intrinsic functions has to be dropped.
Given that you have investigated gfortran in depth, and I tried the previous code example on all of the compilers on Compiler Explorer and found they all vectorize the code (apparently assuming overflow behavior), I think this is worth proceeding with creating a prototype so that we can try this transformation with some existing kernels to double check nothing is broken.
It would be ideal if you could do a similar investigation as you did for gfortran with another fortran compiler that is used more frequently in HPC applications. Perhaps you could provide the same results for Fujitsu’s compiler? From my perspective this isn’t a requirement but it was brought up on the call yesterday.
The fact that gfortran appears to make this assumption in many different cases leads me to think that the prototype does not need all of the restrictions you outlined (in cases where they make the implementation much more difficult as @jeanPerier pointed out).
I still think this should be restricted to arithmetic used for array addressing so that we are not making risky changes where we have no expectation for improvements to code generation. The other restrictions you outlined are welcome too if they make the implementation simpler.
@jeanPerier do you have anything to add regarding your concerns about the pattern matching for the restrictions or shall we proceed to a prototype?
I did a similar investigation using Fujitsu’s compiler and found that our compiler also assumes that integer overflow never occurs in address calculations. However, n+k > n isn’t transformed to k > 0. Therefore, as you mentioned, nsw should be added only to operations in subscripts at this point.
As I mentioned in another thread, I found that nsw on loop bounds could help vectorization. The language standard doesn’t mention about integer overflow of DO statement parameters, but some compilers seem to assume they never overflow. I can say so because the following loop is vectorized without loop versioning.
subroutine s115 (a,n)
integer n, i, j
real a(n)
do j = 1,n
do i = j+1, n
a(i) = a(i) + a(j)
end do
end do
end subroutine
j+1 can be less than j if j+1 overflows, so the loop should be versioned but actually not.
The following is the result of my investigation. I made parameters polynomial forms and checked whether the loop was vectorized.
Compiler Explorer The loop was not vectorized by LLVM even if the operation has the nsw flag.
Considering this, nsw on calculations of initial parameters would be allowed. On the other hand, it is controversial to add nsw to terminal parameters and incrementation parameters. IMHO, flang-new could also allow nsw on them because some other compilers (not all) do so.
We agreed to add nsw only to address calculations to keep transformations safe. However, the above argument violates it. Therefore, I’d like to ask for your opinions.