broken C code only when optimized "-O2"

Hello,

I need some help understanding what might be wrong with a piece of code from the openvswitch project. By ${subject} I'm not suggesting there's a problem in clang, gcc also shows the same behavior so it's likely our code is broken. I am kindly asking for help to understand/troubleshoot the problem.

Summary: It seems that certain interaction between two main openvswitch data structures, when optimized ("-O2 -flto=auto") is broken.
The two data structures are:

hmap: ovs/hmap.h at master · openvswitch/ovs · GitHub
list: ovs/list.h at master · openvswitch/ovs · GitHub

I've reproduced the problem outside of openvswitch daemon using a short C program (attached)

Code snippet:

struct bond {
     struct hmap members;
};

struct member {
     struct hmap_node hmap_node;
     int order;
     struct ovs_list elem;
};

int main() {
     int ret = 0;
     struct member *member, *member1, *member2;
     struct bond *bond;
     struct ovs_list start = {0};

     bond = malloc(sizeof *bond);
     memset(bond, 0, sizeof (struct bond));
     hmap_init(&bond->members);

     member1 = malloc(sizeof *member1);
     member2 = malloc(sizeof *member2);
     memset(member1, 0, sizeof (struct member));
     memset(member2, 0, sizeof (struct member));

     member1->order = 3;
     member2->order = 2;

     hmap_insert(&bond->members, &member1->hmap_node, (uint32_t)(uintptr_t)member1);
     hmap_insert(&bond->members, &member2->hmap_node, (uint32_t)(uintptr_t)member2);

     ovs_list_init(&start);
     HMAP_FOR_EACH (member, hmap_node, &bond->members) {
        /*
         * Insert member in start (sorted)
         * */
        struct member *pos;
        LIST_FOR_EACH (pos, elem, &start) {
            if (member->order > pos->order) {
                break;
            }
        }
        // TESTED: If I add this printf, the problem disappears
        //printf("Inserting member: %p\n", member);
        ovs_list_insert(&pos->elem, &member->elem);
     }

     /* I've inserted two members into the 'start' list.
      * first and last have to be either member1 or member2
      * */
     if ((first != member1 && first != member2) || (last != member1 && last != member2)) {
         printf("list is broken!\n");
     }

}

What I know for now:
* -fno-strict-aliasing does not fix it
* Only happens with "-O2 -flto=auto"
* If I define 'ovs_list *start' and change the code to use the pointer directly and not '&start' the problem disappears. It seems that the LIST_FOR_EACH macros prefer an lvalue rather than "&" but I don't get why.
* I'm not able to reproduce without using hmap _and_ ovs_list.
* If I add a compiler barrier (or a call to an external function) after the loop, the problem disappears (e.g printf), the problem disappears.
* If I add -fsanitize=undefined the problem disappears!

I'd really appreciate any hint or idea to try to understand this problem.

Thanks in advanced.

example.c (3.79 KB)

The thing to do with problems like this is to run with clang’s sanitizers, which diagnose undefined behavior, memory errors, and other such issues that often show up only under optimization. A cursory look at all the casting in this example makes me think there is undefined behavior, but there very well could be another type of failure.

https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html

https://clang.llvm.org/docs/AddressSanitizer.html

https://clang.llvm.org/docs/MemorySanitizer.html

Thanks Sterling

The thing to do with problems like this is to run with clang's sanitizers, which diagnose undefined behavior, memory errors, and other such issues that often show up only under optimization. A cursory look at all the casting in this example makes me think there is undefined behavior, but there very well could be another type of failure.

Interestingly, when I enable UndefinedBehaviorSanitizer, my problem disappears (and sanitizer does not report anything). In fact, the issue is quite elisuve, adding a function call, a compiler barrier or not inlining a single function changes the behavior of the program.

MemorySanitizer does print an error that I'm investigating:

==762381==WARNING: MemorySanitizer: use-of-uninitialized-value
[Detaching after fork from child process 764910]
     #0 0x4b1871 in hmap_next__ /home/amorenoz/code/ovs/out/include/openvswitch/hmap.h:379:13
     #1 0x4af81f in hmap_first /home/amorenoz/code/ovs/out/include/openvswitch/hmap.h:391:12
     #2 0x4adf3a in main /home/amorenoz/dev/bugs/2014942/example.c:82:5
     #3 0x7ffff77c055f in __libc_start_call_main (/lib64/libc.so.6+0x2d55f)
     #4 0x7ffff77c060b in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x2d60b)
     #5 0x42c244 in _start (/home/amorenoz/devel/bugs/2014942/example_+0x42c244)

   Uninitialized value was stored to memory at
     #0 0x4b17d8 in hmap_next__ /home/amorenoz/code/ovs/out/include/openvswitch/hmap.h:378:27
     #1 0x4af81f in hmap_first /home/amorenoz/code/ovs/out/include/openvswitch/hmap.h:391:12
     #2 0x4adf3a in main /home/amorenoz/dev/bugs/2014942/example.c:82:5
     #3 0x7ffff77c055f in __libc_start_call_main (/lib64/libc.so.6+0x2d55f)

   Uninitialized value was created by a heap allocation
     #0 0x45c942 in malloc (/home/amorenoz/devel/bugs/2014942/example_+0x45c942)
     #1 0x4b4c9d in xmalloc__ /home/amorenoz/code/ovs/lib/util.c:137:15
     #2 0x4b4c9d in xmalloc /home/amorenoz/code/ovs/lib/util.c:172:12

Not for me:

% clang -g -fsanitize=undefined -I/Users/dim/tmp/vswitch/foo/include example.c -o example -L/Users/dim/tmp/vswitch/foo/lib -lopenvswitch

% ./example
start: 0x16ee6f618
example.c:78:5: runtime error: applying zero offset to null pointer
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior example.c:78:5 in
example.c:78:5: runtime error: applying zero offset to null pointer
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior example.c:78:5 in
first: 0x600003200270
last: 0x6000032002a0

It looks like the HMAP_FOR_EACH() macro uses null pointer arithmetic. The problem appears to be in https://github.com/openvswitch/ovs/blob/master/include/openvswitch/util.h#L94 :

/* Given OBJECT of type pointer-to-structure, expands to the offset of MEMBER
* within an instance of the structure.

...

It is quite obvious that the code is wrong.
Can't you spot the problem with this?:
     member1 = malloc(sizeof *member1);
     member2 = malloc(sizeof *member2);
     memset(member1, 0, sizeof (struct member));
     memset(member2, 0, sizeof (struct member));

Sorry, I pasted that code snippet from a semi-uncooked version. Removing the pointer dereference in the sizeof (as in the attached version) has no effect on the problem.

It is quite obvious that the code is wrong.
Can't you spot the problem with this?:

Please keep your comments positive and helpful, in accordance with
the standards of the LLVM community.

     member1 = malloc(sizeof *member1);
     member2 = malloc(sizeof *member2);
     memset(member1, 0, sizeof (struct member));
     memset(member2, 0, sizeof (struct member));

If you're referring to the dereference in (sizeof *member1),
the operand of sizeof is unevaluated, which AFAICT means it
won't trigger undefined behavior for being a null dereference.
--paulr

I need some help understanding what might be wrong with a piece of code from the openvswitch project. By ${subject} I'm not suggesting there's a problem in clang, gcc also shows the same behavior so it's likely our code is broken. I am kindly asking for help to understand/troubleshoot the problem.

Summary: It seems that certain interaction between two main openvswitch data structures, when optimized ("-O2 -flto=auto") is broken.
The two data structures are:

hmap: https://github.com/openvswitch/ovs/blob/master/include/openvswitch/hmap.h
list: https://github.com/openvswitch/ovs/blob/master/include/openvswitch/list.h

I've reproduced the problem outside of openvswitch daemon using a short C program (attached)

Code snippet:

struct bond {
    struct hmap members;
};

struct member {
    struct hmap_node hmap_node;
    int order;
    struct ovs_list elem;
};

int main() {
    int ret = 0;
    struct member *member, *member1, *member2;
    struct bond *bond;
    struct ovs_list start = {0};

    bond = malloc(sizeof *bond);
    memset(bond, 0, sizeof (struct bond));
    hmap_init(&bond->members);

    member1 = malloc(sizeof *member1);
    member2 = malloc(sizeof *member2);
    memset(member1, 0, sizeof (struct member));
    memset(member2, 0, sizeof (struct member));

    member1->order = 3;
    member2->order = 2;

    hmap_insert(&bond->members, &member1->hmap_node, (uint32_t)(uintptr_t)member1);
    hmap_insert(&bond->members, &member2->hmap_node, (uint32_t)(uintptr_t)member2);

    ovs_list_init(&start);
    HMAP_FOR_EACH (member, hmap_node, &bond->members) {
       /*
        * Insert member in start (sorted)
        * */
       struct member *pos;
       LIST_FOR_EACH (pos, elem, &start) {
           if (member->order > pos->order) {
               break;
           }
       }
       // TESTED: If I add this printf, the problem disappears
       //printf("Inserting member: %p\n", member);
       ovs_list_insert(&pos->elem, &member->elem);
    }

    /* I've inserted two members into the 'start' list.
     * first and last have to be either member1 or member2
     * */
    if ((first != member1 && first != member2) || (last != member1 && last != member2)) {
        printf("list is broken!\n");
    }

}

What I know for now:
* -fno-strict-aliasing does not fix it
* Only happens with "-O2 -flto=auto"
* If I define 'ovs_list *start' and change the code to use the pointer directly and not '&start' the problem disappears. It seems that the LIST_FOR_EACH macros prefer an lvalue rather than "&" but I don't get why.
* I'm not able to reproduce without using hmap _and_ ovs_list.
* If I add a compiler barrier (or a call to an external function) after the loop, the problem disappears (e.g printf), the problem disappears.
* If I add -fsanitize=undefined the problem disappears!

Not for me:

% clang -g -fsanitize=undefined -I/Users/dim/tmp/vswitch/foo/include example.c -o example -L/Users/dim/tmp/vswitch/foo/lib -lopenvswitch

Thanks for running it!

% ./example
start: 0x16ee6f618
example.c:78:5: runtime error: applying zero offset to null pointer
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior example.c:78:5 in
example.c:78:5: runtime error: applying zero offset to null pointer
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior example.c:78:5 in
first: 0x600003200270
last: 0x6000032002a0

It looks like the HMAP_FOR_EACH() macro uses null pointer arithmetic. The problem appears to be in https://github.com/openvswitch/ovs/blob/master/include/openvswitch/util.h#L94 :

/* Given OBJECT of type pointer-to-structure, expands to the offset of MEMBER
  * within an instance of the structure.
  *
  * The GCC-specific version avoids the technicality of undefined behavior if
  * OBJECT is null, invalid, or not yet initialized. This makes some static
  * checkers (like Coverity) happier. But the non-GCC version does not actually
  * dereference any pointer, so it would be surprising for it to cause any
  * problems in practice.
  */
#ifdef __GNUC__
#define OBJECT_OFFSETOF(OBJECT, MEMBER) offsetof(typeof(*(OBJECT)), MEMBER)
#else
#define OBJECT_OFFSETOF(OBJECT, MEMBER) \
     ((char *) &(OBJECT)->MEMBER - (char *) (OBJECT))
#endif

The comment is incorrect here, because dereferencing a null pointer, as done in *(OBJECT), is definitely undefined behavior.

Thanks Dimitry for spotting it.

So you were running in a non GNUC system? I was testing on Fedora so we were using different implementations of this macro. I will raise it with the openvswitch community that the non GNUC is being flagged by UBSan as an error.

For the _GNUC_ version, is using *(OBJECT) inside the "typeof" operand also undefined behavior?

Hi,

I have a suspicion of what could be causing the problem, I'd like to run through you to get your feedback from the compiler pov.

One of the things that originally felt smelly was that the fact that the macros that iterate the list elements assume the "struct ovs_list" element is embedded into another "struct member":

struct member *pos = 0;

for ((((pos) = ((struct member *) (void *) ((uintptr_t)(void *)((&start)->next) - __builtin_offsetof (struct member, elem)))));
      &(pos)->elem != (&start);
      ((pos) = ((struct member *) (void *) ((uintptr_t)(void *)((pos)->elem.next) - __builtin_offsetof (struct member , elem)
)))) {
    [... use pos ]
      }

however, the first node in the list ("start") is a "struct ovs_list" defined in the stack and _not_ embedded into a "struct member".

Therefore, "(&start)->next - __builtin_offsetof (struct member, elem)" actually points to somewhere in the stack that contains who knows what. (Note: initially start->next = start)

In fact, if I make the struct offsets zero:

struct member {
     struct ovs_list elem;
     [ ... ]
     int order;
};

... the code works. However if I use:

struct member {
     int padding[10];
     struct ovs_list elem;
     [ ... ]
     int order;
};

... the code fails.

Does anything of what I'm saying make sense so far?

If the code inside the loop just made use of "pos" through "(&pos->elem)" then the compiler could(?) be ok with it but the loop actually contains:
            if (member->order > pos->order) {
                break;
            }

So here I do not know what the compiler would think about "pos" if it happens to point to some invalid stack address.

Thanks for the help.

I need some help understanding what might be wrong with a piece of code from the openvswitch project. By ${subject} I'm not suggesting there's a problem in clang, gcc also shows the same behavior so it's likely our code is broken. I am kindly asking for help to understand/troubleshoot the problem.

...

% ./example
start: 0x16ee6f618
example.c:78:5: runtime error: applying zero offset to null pointer
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior example.c:78:5 in
example.c:78:5: runtime error: applying zero offset to null pointer
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior example.c:78:5 in
first: 0x600003200270
last: 0x6000032002a0
It looks like the HMAP_FOR_EACH() macro uses null pointer arithmetic. The problem appears to be in ovs/util.h at master · openvswitch/ovs · GitHub :
/* Given OBJECT of type pointer-to-structure, expands to the offset of MEMBER
* within an instance of the structure.
*
* The GCC-specific version avoids the technicality of undefined behavior if
* OBJECT is null, invalid, or not yet initialized. This makes some static
* checkers (like Coverity) happier. But the non-GCC version does not actually
* dereference any pointer, so it would be surprising for it to cause any
* problems in practice.
*/
#ifdef __GNUC__
#define OBJECT_OFFSETOF(OBJECT, MEMBER) offsetof(typeof(*(OBJECT)), MEMBER)
#else
#define OBJECT_OFFSETOF(OBJECT, MEMBER) \
    ((char *) &(OBJECT)->MEMBER - (char *) (OBJECT))
#endif
The comment is incorrect here, because dereferencing a null pointer, as done in *(OBJECT), is definitely undefined behavior.

Thanks Dimitry for spotting it.

So you were running in a non GNUC system? I was testing on Fedora so we were using different implementations of this macro. I will raise it with the openvswitch community that the non GNUC is being flagged by UBSan as an error.

Yes, I have run this on macOS 12.1 with Apple clang-1300.0.29.30, and on FreeBSD 14-CURRENT, with clang 13.0.0. In both cases, turning optimization on or off does not make a difference for the UBSan reports. I only added -g to be able to debug.

Note also that neither on macOS nor on FreeBSD do I ever get the program to print "list is broken!". It does not matter which optimization level I choose. So here you see that the system's memory layout will likely have an influence on the result.

Since you told us that adding LTO flags makes the problem appear, maybe this is related? I know that on FreeBSD, none of the libraries you link in are stored in LLVM bitcode (yet, at least:), and I suspect that this is also the case on macOS.

I attached another version of the example, with the external functions from openvswitch inlined, so it should only depend on system headers, and won't need any external libraries. Maybe this makes it easier to reason about the problem. Again, on macOS and FreeBSD this shows UBSan messages, but never "list is broken!".

For the _GNUC_ version, is using *(OBJECT) inside the "typeof" operand also undefined behavior?

(Note that clang also defines __GNUC__ so this block will be used for clang too.)

If OBJECT ever becomes a null pointer, as in the example, then it is undefined.

And I don't think an argument "I'm only doing a typeof(*nullptr), not really accessing it" works... :slight_smile:

-Dimitry

example-inlined.c (10.6 KB)

typeof(*NULL) should be fine, just like sizeof(*NULL), it’s not a “real” dereference.

At least, -Wnull-dereference isn’t emitted for either.

Jess

I need some help understanding what might be wrong with a piece of code from the openvswitch project. By ${subject} I'm not suggesting there's a problem in clang, gcc also shows the same behavior so it's likely our code is broken. I am kindly asking for help to understand/troubleshoot the problem.

...

% ./example
start: 0x16ee6f618
example.c:78:5: runtime error: applying zero offset to null pointer
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior example.c:78:5 in
example.c:78:5: runtime error: applying zero offset to null pointer
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior example.c:78:5 in
first: 0x600003200270
last: 0x6000032002a0
It looks like the HMAP_FOR_EACH() macro uses null pointer arithmetic. The problem appears to be in https://github.com/openvswitch/ovs/blob/master/include/openvswitch/util.h#L94 :
/* Given OBJECT of type pointer-to-structure, expands to the offset of MEMBER
* within an instance of the structure.
*
* The GCC-specific version avoids the technicality of undefined behavior if
* OBJECT is null, invalid, or not yet initialized. This makes some static
* checkers (like Coverity) happier. But the non-GCC version does not actually
* dereference any pointer, so it would be surprising for it to cause any
* problems in practice.
*/
#ifdef __GNUC__
#define OBJECT_OFFSETOF(OBJECT, MEMBER) offsetof(typeof(*(OBJECT)), MEMBER)
#else
#define OBJECT_OFFSETOF(OBJECT, MEMBER) \
    ((char *) &(OBJECT)->MEMBER - (char *) (OBJECT))
#endif
The comment is incorrect here, because dereferencing a null pointer, as done in *(OBJECT), is definitely undefined behavior.

Thanks Dimitry for spotting it.

So you were running in a non GNUC system? I was testing on Fedora so we were using different implementations of this macro. I will raise it with the openvswitch community that the non GNUC is being flagged by UBSan as an error.

Yes, I have run this on macOS 12.1 with Apple clang-1300.0.29.30, and on FreeBSD 14-CURRENT, with clang 13.0.0. In both cases, turning optimization on or off does not make a difference for the UBSan reports. I only added -g to be able to debug.

Note also that neither on macOS nor on FreeBSD do I ever get the program to print "list is broken!". It does not matter which optimization level I choose. So here you see that the system's memory layout will likely have an influence on the result.

Since you told us that adding LTO flags makes the problem appear, maybe this is related? I know that on FreeBSD, none of the libraries you link in are stored in LLVM bitcode (yet, at least:), and I suspect that this is also the case on macOS.

I attached another version of the example, with the external functions from openvswitch inlined, so it should only depend on system headers, and won't need any external libraries. Maybe this makes it easier to reason about the problem. Again, on macOS and FreeBSD this shows UBSan messages, but never "list is broken!".

Thanks very much Dimitry. With your inlined example clang does work but gcc still fails. So clearly the linking does influence.

For the _GNUC_ version, is using *(OBJECT) inside the "typeof" operand also undefined behavior?

(Note that clang also defines __GNUC__ so this block will be used for clang too.)

If OBJECT ever becomes a null pointer, as in the example, then it is undefined.

And I don't think an argument "I'm only doing a typeof(*nullptr), not really accessing it" works... :slight_smile:

typeof(*NULL) should be fine, just like sizeof(*NULL), it’s not a “real” dereference.

At least, -Wnull-dereference isn’t emitted for either.

Jess

After manually expanding the macros and replacing all the typeof(*nullptr) with the actual type, the behavior of both inlined and statically linked versions remains the same.

What UBSan is flagging is the use of:
((void*)0)) - __builtin_offsetof(struct member, hmap_node))))

So we clearly need to change that to avoid the potential undefined behavior

Also tried a patch based on the theory I've posted in the other subthread (that one possible issue is that I'm using a "struct ovs_list" allocated in the stack and treating them as if it was embedded inside a "struct member", which generates potential accesses to arbitrary stack addresses). The patch below fixes both clang and gcc in all optimization levels.

But I wouldn't like to stop the investigation until I understand all the potential problems (these macros and data structures are wildly used in openvswitch).

--- example-inlined.c 2021-12-22 16:05:12.584459052 +0100
+++ example-inlined-patched.c 2021-12-22 15:55:21.168212637 +0100
@@ -318,7 +320,9 @@
      //struct ovs_list sstart;
      //struct ovs_list start = &start;

- struct ovs_list start = {0};
+ // Using a member to hold the ovs_list start
+ struct member start_member = {0};
+ struct ovs_list *start = &start_member.elem;

      bond = malloc(sizeof (struct bond));
      memset(bond, 0, sizeof (struct bond));
@@ -334,14 +338,14 @@
      hmap_insert(&bond->members, &member1->hmap_node, (uint32_t)(uintptr_t)member1);
      hmap_insert(&bond->members, &member2->hmap_node, (uint32_t)(uintptr_t)member2);

- printf("start: %p\n", &start);
- ovs_list_init(&start);
+ printf("start: %p\n", start);
+ ovs_list_init(start);
      HMAP_FOR_EACH (member, hmap_node, &bond->members) {
         /*
          * Insert member in start (sorted)
          * */
         struct member *pos;
- LIST_FOR_EACH (pos, elem, &start) {
+ LIST_FOR_EACH (pos, elem, start) {
             if (member->order > pos->order) {
                 break;
             }
@@ -351,13 +355,14 @@
         ovs_list_insert(&pos->elem, &member->elem);
      }