Optimization issue - how do I use normal registers and loop unrolling

I am trying to generate loop unrolling for a lookahead guard mechanism for a recursive descent parser.

Here is the code :-

And heres what GodBolt is giving me for Clang, its using xmm, when I really want this to boil down to a 64 bit AND and a JNZ :-

GCC is giving me alright results for this :-

If someone who knows about Clang optimization can have a look at this please ?

Many thanks in advance,


Okay I think I have rationalized things a touch :-

Both come down to a TEST instruction near enough basically :-

MS Visual Studio C++ is a nightmare still though :-

Solved MS VS C++ on command line, getting simular results, although not as good as Clang or GCC.

I am now looking at proper implementation in C++, and also C.