Bad codegen for handrolled unaligned write

I’ve been looking on how to do an unaligned write without memcpy, and that is what I came up with:

It looks like the optimizations are not as good as I thought: arg32 is doing shifts and multiple memory accesses instead of a single mov, and cons does 3 movs instead of a single one.

GCC is also doing some weird stuff by putting the constant in rdata, but it’s still better overall. MSVC is doing its own thing :man_shrugging: .

Can this be done better?

Don’t manually unroll loops: Compiler Explorer. Unfortunately, both gcc and clang both go back to doing stupid things with -fno-builtin, but that’s a separate issue.

The least worst way to do unaligned load/store AFAIK is using __attribute__((aligned(1))): Compiler Explorer. That works on both gcc and clang, and MSVC if you add __declspec(aligned(1)) or somesuch, but for some reason clang does bytewise copies with the attribute despite producing perfectly normal 4-byte-aligned code with the cast. It’s still better than the horror produced with memcpy though. My guess is that nobody on either gcc or LLVM cares about armv6 codegen because it’s totally obsolete.

On an unrelated note, static_cast is sufficient to convert from void *.

Should have mentioned, my whole problem is that I need to support code compiled under -fno-builtin. Right now I’m just doing memcpy whenever and that works, but under that flag it just calls the function even for small sizes, which is really bad.

The attribute idea looks good at a first glance, will try that. Thanks!

There’s __builtin_memcpy, and __builtin_memcpy_inline if you always want it inlined.

1 Like