Lowering llvm.memset for ARM target

As reported in an earlier thread
(http://clang-developers.42468.n3.nabble.com/Disable-memset-synthesis-tp4057810.html),
we noticed in some cases that the llvm.memset intrinsic, if lowered to
stores, could help with performance.

Here's a test case: If LIMIT is > 8, I see that a call to memset is
emitted for arm & aarch64, but not for x86 target.

typedef struct {
    int v0[100];
} test;
#define LIMIT 9
void init(test *t)
{
    int i;
    for (i = 0; i < LIMIT ; i++)
      t->v0[i] = 0;
}
int main() {
test t;
init(&t);
return 0;
}

Looking at the llvm sources, I see that there are two key target
specific variables, MaxStoresPerMemset and MaxStoresPerMemsetOptSize,
that determine if the intrinsic llvm.memset can be lowered into store
operations. For ARM, these variables are set to 8 and 4 respectively.

I do not know as to how the default values for these two variables are
arrived at, but doubling these values (similar to that for the x86
target) seems to help our case and we observe a 7% increase in
performance of our networking application. We use -O3 and -flto and
32-bit arm.

I can prepare a patch and post for review if such a change, say under
CodeGenOpt::Aggressive would be acceptable.

Thanks,
Bharathi

Hi Bharathi,

MaxStoresPerMemset was changed from 16 to 8 in r 169791. The commit comment:

"Some enhancements for memcpy / memset inline expansion.
1. Teach it to use overlapping unaligned load / store to copy / set the trailing
   bytes. e.g. On 86, use two pairs of movups / movaps for 17 - 31 byte copies.
2. Use f64 for memcpy / memset on targets where i64 is not legal but f64 is. e.g.
   x86 and ARM.
3. When memcpy from a constant string, do *not* replace the load with a constant
   if it's not possible to materialize an integer immediate with a single
   instruction (required a new target hook: TLI.isIntImmLegal()).
4. Use unaligned load / stores more aggressively if target hooks indicates they
   are "fast".
5. Update ARM target hooks to use unaligned load / stores. e.g. vld1.8 / vst1.8.
   Also increase the threshold to something reasonable (8 for memset, 4 pairs
   for memcpy).

This significantly improves Dhrystone, up to 50% on ARM iOS devices.

rdar://12760078"

It's strange. According to the comment the threshold was increased but it is decreased. I think the code needs to be revisited and benchmarked.
I'll do some benchmarking.

Thanks,
Evgeny Astigeevich | Arm Compiler Optimization Team Lead

Hi Bharathi,

From the discussion you provided I found that the issue happens for a big-endian ARM target.
For the little-endian target the intrinsic in your test case is lowered to store instructions.
Some debugging is needed to figure out why it's not happening for big-endian.

-Evgeny

Hi Bharathi,

'-mfpu=vfp ' is the root cause of the problem. It means: VFPv2, disabled Advanced SIMD extension.
The intrinsic is lowered into stores only when Advanced SIMD extension is enabled. So, if your target supports the Advanced SIMD extension, the workaround is '-mfpu=neon'.
I'll check what is happening when the Advanced SIMD extension is disabled.

Thanks,
Evgeny

Hi Bharathi,

I did some debugging, the current problem is that the same threshold values are used for SIMD and non-SIMD memory instructions.
For the test you provided, when the SIMD extension is disabled the current implementation of llvm.memset lowering finds out 9 store instructions will be required. As 9 > 8 llvm.memset is lowered to a call.
When the SIMD is enabled, it finds out 3 stores (two vst1.32 + one str) will be enough. As 3 < 8 llvm.memset is lowered to a sequence of stores.

So before changing the threshold values we need to figure out:

1. Do we need separate thresholds for SIMD and non-SIMD memory instructions? For example, 8 for SIMD and 10-16 for non-SIMD. Some benchmarking is needed to find proper value.
2. If we keep the single value, how much it should be increased? This might affect performance of SIMD using applications. So benchmarking again.
3. If STRD are used then only 5 instructions are needed and llvm.memset is lowered as expected. I don’t know why the variant with STRD is not considered, maybe to avoid register pressure.

Hope this helps.

Thanks,
Evgeny Astigeevich

We probably do want to lower memset to strd/stm when we don't have NEON. Patch welcome. :slight_smile: (Note that in Thumb2, we often form strd anyway in ARMLoadStoreOptimizer, but we don't do it reliably, and we probably want to do something similar in ARM, and maybe Thumb1.)

-Eli

Hi Eli,

Thank you for information. Now it’s clear what to do.
BTW I tried to apply Thumb2 to the test and it didn’t work. It means we have more motivation to fix this (.

Thanks,
Evgeny