Liveness of AL, AH and AX in x86 backend

I'm trying to see how the x86 backend deals with the relationship between AL, AH and AX, but I can't get it to generate any code that would expose an interesting scenario.

For example, I wrote this piece:

typedef struct {
   char x, y;
} struct_t;

struct_t z;

struct_t foo(char *p) {
   struct_t s;
   s.x = *p++;
   s.y = *p;
   z = s;
   s.x++;
   return s;
}

But the output at -O2 is

foo: # @foo
         .cfi_startproc
# BB#0: # %entry
         movb (%rdi), %al
         movzbl 1(%rdi), %ecx
         movb %al, z(%rip)
         movb %cl, z+1(%rip)
         incb %al
         shll $8, %ecx
         movzbl %al, %eax
         orl %ecx, %eax
         retq

I was hoping it would do something along the lines of

   movb (%rdi), %al
   movb 1(%rdi), %ah
   movh %ax, z(%rip)
   incb %al
   retq

Why is the x86 backend not getting this code? Does it know that AH:AL = AX?

-Krzysztof

Try using x86 mode rather than Intel64 mode. I have definitely gotten it to use both ah and al in 32 bit x86 code generation.
In particular, I have seen that in loops for both the spec2000 and spec2006 versions of bzip. It can happen, but it does only rarely.

Kevin Smith

I changed the triple to i386 and the CPU to pentium, but I still didn't get the good code.

foo: # @foo
         .cfi_startproc
# BB#0: # %entry
         movl 4(%esp), %eax
         movb (%eax), %dl
         movzbl 1(%eax), %ecx
         movb %dl, z
         movb %cl, z+1
         incb %dl
         shll $8, %ecx
         movzbl %dl, %eax
         orl %ecx, %eax
         retl

-Krzysztof

On several variants of x86 processors, mixing ah, al and ax as source/destination in the same dependency chain will have some penalties, so for THOSE processors, there is a benefit to NOT use al and ah to reflect parts of ax - I believe this is caused by the fact that the processor doesn’t ACTUALLY see these as parts of a bigger register internally, and will execute two independent dependency chains, UNTIL you start using ax as one register. At this point, the processor has to make sure both of dependency chains for al and ah have been complete, and that the merged value is available in ax. If the processor uses cl and al, this sort of problem is avoided.

<<Quote from Intel Optimisation guide, page 3-44
http://www.intel.co.uk/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf

A partial register stall happens when an instruction refers to a register, portions of
which were previously modified by other instructions. For example, partial register
stalls occurs with a read to AX while previous instructions stored AL and AH, or a read
to EAX while previous in
struction modified AX.
The delay of a partial register stall is small in processors based on Intel Core and
NetBurst microarchitectures, and in Pentium M processor (with CPUID signature
family 6, model 13), Intel Core Solo,
and Intel Core Duo processors. Pentium M
processors (CPUID signature with family 6,
model 9) and the P6 family incur a large
penalty.

<>

So for compact code, yes, it’s probably an advantage. For SOME processors in the x86 range, not so good for performance.

Whether LLVM has the information as to WHICH processor models have such penalties (or better yet, can determine the amount of time lost for this sort of operation), I’m not sure. It’s obviously something that CAN be programmed into a compiler, it’s just a matter of understanding the effort vs. reward factor for this particular type of optimisation, compared to other things that could be done to improve the quality of the code generated.

Then let me shift focus from performance to size. With either optsize or minsize, the output is still the same.

As per the subject, I'm not really interested in the quality of the final code, but in the way that the x86 target deals with the structural relationship between these registers. Specifically, I'd like to see if it would generate implicit defs/uses for AX on defs/uses of AH/AL. I looked in the X86 sources and I didn't find code that would make me certain, but I'm not too familiar with that backend. Having a testcase to work with would make it a lot easier for me.

-Krzysztof

What I was trying to say is that in the past, this sort of construct could be quite significantly worse than the “ideal” code (avoiding such partial register things), and as such I expect some effort has gone into AVOIDING using the al and ah in ways that lead to ax being used as a combination thereof. Whilst it may be detrimental for code-size, it’s possible that this is so ingrained in the code that it’s hard to avoid it producing code like what you posted.

Of course, I’m only speculating, and there may be other answers. I’m NOT a “x86-backend guy” (nor any other kind of backend, for that matter - the little I know about compilers is from working on the frontend). I do however, have a fair understanding both of processor architectures and the optimisation of x86-code in general from having worked within AMD for about 10 years and other places where code-generation and optimisation are important. I’m not so hot on the latest x86 processors, as I’ve been doing most of my recent work with mobile phone systems (not the x86-based variety). But the “partial register stall” is quite old, so I do remember that.

Hi Krzysztof,

I'm trying to see how the x86 backend deals with the relationship between AL, AH and AX, but I can't get it to generate any code that would expose an interesting scenario.

For example, I wrote this piece:

typedef struct {
char x, y;
} struct_t;

struct_t z;

struct_t foo(char *p) {
struct_t s;
s.x = *p++;
s.y = *p;
z = s;
s.x++;
return s;
}

But the output at -O2 is

foo: # @foo
       .cfi_startproc
# BB#0: # %entry
       movb (%rdi), %al
       movzbl 1(%rdi), %ecx
       movb %al, z(%rip)
       movb %cl, z+1(%rip)
       incb %al
       shll $8, %ecx
       movzbl %al, %eax
       orl %ecx, %eax
       retq

I was hoping it would do something along the lines of

movb (%rdi), %al
movb 1(%rdi), %ah
movh %ax, z(%rip)
incb %al
retq

Why is the x86 backend not getting this code?

Try enabling the sub-register liveness feature. I am guessing we think we cannot use the same register for the low and high part.
Though, I would need to see the machine instrs to be sure.

Does it know that AH:AL = AX?

Yes it does.

Cheers,
-Quentin

Enabling subreg liveness tracking didn't do anything. By altering the allocation order I managed to get the backend to use CL/CH for the struct, but the stores were still separate (even though storing CX would be correct)...

Here's another question that falls into the same category:

The function X86InstrInfo::loadRegFromStackSlot does not append any implicit uses/defs. How does it know that it won't need them? If AX was spilled in the middle of a live range of EAX, wouldn't restoring of AX need to implicitly define EAX?

We deal with such cases a lot in the Hexagon backend and it continues to be a major pain. I'm trying to understand if there are better options for us.

-Krzysztof

I'll try to find the example code from bzip. I ran across this exact situation when doing work on X86FxupBWInsts.cpp. It can happen, but the register allocator
doesn't seem to want to do it until it has run out of the low order byte registers. I suspect that this is simply due to the register allocations order preferences, and
yes, the upper byte registers definitely were to be avoided in past x86 architectures, and still are, althoug to a lesser extent in current generation x86 architectures.

Kevin Smith

Hi,

Could you use "MIR" to forge the example you're looking for?

Here's some of the generated code from the current community head for bzip2.c from spec 256.bzip2, with these options:

clang -m32 -S -O2 bzip2.c

.LBB14_4: # %bsW.exit24
        subl %eax, %ebx
        addl $8, %eax
        movl %ebx, %ecx
        movl %eax, bsLive
        shll %cl, %edi
        movl %ebp, %ecx
        orl %esi, %edi
        movzbl %ch, %esi
        cmpl $8, %eax
        movl %edi, bsBuff
        jl .LBB14_6

As you can see, it is using both cl and ch for different values in this basic block. This occurs in the generated code for the routine bsPutUInt32

Kevin Smith

I have thought of that. At the moment I'm experimenting with tracking subregister liveness on Hexagon. It looks like it may be what we need, although it causes some testcase failures.

If the subregister liveness is not the way to go, I'll have to try MIR.

-Krzysztof

Thanks Kevin. This isn't exactly what I'm looking for, though. The ECX is explicitly defined here and CL/CH are only used. I was interested in the opposite situation---where the sub-registers are defined separately and then the super-register is used as a whole.

Hopefully the sub-register liveness tracking is what I need, so the questions about x86 may become moot.

-Krzysztof

Sorry, I misunderstood what you were looking for then. I have never seen the x86 backend create separate defs of
ah, al, and then use ax as the combined value. I tried to come up with such cases and never found any. I suspect
that current CG can never generate such code.

Kevin

Enabling subreg liveness tracking didn’t do anything. By altering the allocation order I managed to get the backend to use CL/CH for the struct, but the stores were still separate (even though storing CX would be correct)…

Here’s another question that falls into the same category:

The function X86InstrInfo::loadRegFromStackSlot does not append any implicit uses/defs. How does it know that it won’t need them? If AX was spilled in the middle of a live range of EAX, wouldn’t restoring of AX need to implicitly define EAX?

Doing that would say that we override the other lanes of EAX, which is not what we want. In what cases, do we need to add those implicit arguments?

Also, IIRC, right now, even with sub register liveness enabled, we would spill the whole register. The subreg liveness only gives us more precise interference, but does not affect splitting or spilling.

If you had
   AL<def> = ...
   AH<def> = ...
   ... = AX

you'd need implicit uses/defs to define AX. This sort of thing happens on Hexagon very often: general purpose registers can be paired into 64-bit registers (and used as a whole in 64-bit instructions) and it is not uncommon that the elements of the pair will be defined individually.

In the above case you'd need something like
   AL<def> = ..., AX<imp-def>
   AH<def> = ..., AX<imp-def>, AX<imp-use>
   ... = AX

I was trying to replicate a similar situation in the X86 backend to see what it would do. However, this is not needed anymore, because subregister liveness tracking looks very promising in eliminating this problem altogether. Now, if only the anti-dep problem was fixed, things would look peachy... :slight_smile:

-Krzysztof

Doing that would say that we override the other lanes of EAX, which is
not what we want. In what cases, do we need to add those implicit arguments?

If you had
AL<def> = ...
AH<def> = ...
... = AX

you'd need implicit uses/defs to define AX.

I see. We would need that support if we would spill the value only partially, which we don’t :).

This sort of thing happens on Hexagon very often: general purpose registers can be paired into 64-bit registers (and used as a whole in 64-bit instructions) and it is not uncommon that the elements of the pair will be defined individually.

In the above case you'd need something like
AL<def> = ..., AX<imp-def>
AH<def> = ..., AX<imp-def>, AX<imp-use>
... = AX

I was trying to replicate a similar situation in the X86 backend to see what it would do.

Sounds hard to reproduce.

However, this is not needed anymore, because subregister liveness tracking looks very promising in eliminating this problem altogether. Now, if only the anti-dep problem was fixed, things would look peachy... :slight_smile:

Heh.