'invalid subroutine type ref' when linking custom metadata

With 3.9, llvm-link tells me 'invalid subroutine type ref' when
linking the two code pieces below, and I don't quite understand why.
It looks like it merges the debug information with the custom
metadata. I've filed a ticket already [1] but as I'm not sure if this
is indeed a bug or if I'm misunderstanding something, I thought I'd
ask here.

Any ideas? Thanks,

Robin

    > cat a.ll
    define void @a() {
      ret void
    }
    
    !bar = !{!3}
    
    !3 = !{void ()* @a}
    
    > cat b.ll
    define void @foo() !dbg !20 {
      ret void
    }
    
    !llvm.module.flags = !{!17, !18}
    
    !0 = distinct !DICompileUnit(language: DW_LANG_C89, file: !1)
    !1 = !DIFile(filename: "x.c", directory: "/tmp")
    !17 = !{i32 2, !"Dwarf Version", i32 4}
    !18 = !{i32 2, !"Debug Info Version", i32 3}
    !20 = distinct !DISubprogram(name: "foo", scope: !1, file: !1, line: 92, type: !21, unit: !0)
    !21 = !DISubroutineType(types: !22)
    !22 = !{null}
    !23 = !{}
    
    > llvm-link -o out a.ll b.ll
    invalid subroutine type ref
    !5 = !DISubroutineType(types: !0)
    !0 = !{void ()* @a}
    void ()* @a
    All DICompileUnits must be listed in llvm.dbg.cu
    llvm-link: error: linked module is broken!

(The *.ll files are stripped down from much larger files to find a
minimal example. I suppose the message about llvm.dbg.cu is just
because I removed a bit too much).

Robin

[1] https://llvm.org/bugs/show_bug.cgi?id=28697

+CC Duncan```

With 3.9, llvm-link tells me 'invalid subroutine type ref' when
linking the two code pieces below, and I don't quite understand why.
It looks like it merges the debug information with the custom
metadata. I've filed a ticket already [1] but as I'm not sure if this
is indeed a bug or if I'm misunderstanding something, I thought I'd
ask here.

Any ideas?

This is a bug, and I believe it is caused by r266579.

It seems that:

1) The module for a.ll is created.
2) "!3 = !{void ()* @a}" is a MDTuple, and MDTuple nodes are leaking to the context and survive module deletion.
3) "void ()* @a” which is the only element in this MDTuple is a “ConstantAsMetadata”, this does not leak and will be deleted with the module.
4) a.ll is linked into an empty module, using an IRMover. This process will involved populating a map of Metadata -> mapped Metadata. One entry will be created with this MDTuple !3.
5) a.ll is destroyed, including the “ConstantAsMetadata”.
6) b.ll is parsed, the “null” constant will get the same pointer as the previously deleted constant from a.ll.
7) "!22 = !{null}” is parsed. Since this is a “uniqued” node, we are looking in the context for a MDTuple with the null constant as a single operand. We find the previously "!3 = !{void ()* @a}”, and this is fine since it nows point to the new “null” constant”.
8) We reuse the map of metadata to link b.ll, however this map is no longer valid since the MDTuple entry for !3 is outdated and will map to the moved constant "void ()* @a”.

Duncan: can you confirm that this analysis makes sense to you?
Is it an oversight of your commit in r266579?

Thanks,

Thanks, Medi, I can confirm that if I disable the cache with the hack
below, it works fine.

Robin

diff --git a/lib/Linker/IRMover.cpp b/lib/Linker/IRMover.cpp
index 4935868..7d3d0b1 100644
--- a/lib/Linker/IRMover.cpp
+++ b/lib/Linker/IRMover.cpp
@@ -490,11 +490,11 @@ public:
                &GValMaterializer),
         AliasMCID(Mapper.registerAlternateMappingContext(AliasValueMap,
                                                          &LValMaterializer)) {
- ValueMap.getMDMap() = std::move(SharedMDs);
+ /* ValueMap.getMDMap() = std::move(SharedMDs); */
     for (GlobalValue *GV : ValuesToLink)
       maybeAdd(GV);
   }
- ~IRLinker() { SharedMDs = std::move(*ValueMap.getMDMap()); }
+ ~IRLinker() { /* SharedMDs = std::move(*ValueMap.getMDMap()); */ }

   Error run();
   Value *materialize(Value *V, bool ForAlias);

+CC Duncan```

With 3.9, llvm-link tells me 'invalid subroutine type ref' when
linking the two code pieces below, and I don't quite understand why.
It looks like it merges the debug information with the custom
metadata. I've filed a ticket already [1] but as I'm not sure if this
is indeed a bug or if I'm misunderstanding something, I thought I'd
ask here.

Any ideas?

This is a bug, and I believe it is caused by r266579.

It seems that:

1) The module for a.ll is created.
2) "!3 = !{void ()* @a}" is a MDTuple, and MDTuple nodes are leaking to the context and survive module deletion.
3) "void ()* @a” which is the only element in this MDTuple is a “ConstantAsMetadata”, this does not leak and will be deleted with the module.
4) a.ll is linked into an empty module, using an IRMover. This process will involved populating a map of Metadata -> mapped Metadata. One entry will be created with this MDTuple !3.
5) a.ll is destroyed, including the “ConstantAsMetadata”.
6) b.ll is parsed, the “null” constant will get the same pointer as the previously deleted constant from a.ll.

Ugh. This behaviour is ugly.

7) "!22 = !{null}” is parsed. Since this is a “uniqued” node, we are looking in the context for a MDTuple with the null constant as a single operand. We find the previously "!3 = !{void ()* @a}”, and this is fine since it nows point to the new “null” constant”.
8) We reuse the map of metadata to link b.ll, however this map is no longer valid since the MDTuple entry for !3 is outdated and will map to the moved constant "void ()* @a”.

Duncan: can you confirm that this analysis makes sense to you?
Is it an oversight of your commit in r266579?

This sounds correct to me :(. Unfortunately the shared map is quite important for performance.

Once possible fix is to make the mapped-value a TrackingMDNodeRef, but that seems like a hack. Is there a way to invalidate just the right parts of the cache?

Another possible solution, that starts with a history lesson: before I split Metadata from the Value hierarchy, when an MDNode operand changed to null, the MDNode would drop uniquing. I didn't understand why that would be useful outside of speeding up teardown. Since it was actively harmful in some cases, I removed the behaviour, and tried to optimize teardown in other ways.

But I can see how it would make sense here. Maybe, when an MDNode operand that is a ConstantAsMetadata gets RAUW'ed to "null" and/or deleted, we should drop uniquing from the MDNode. This is a more targeted version of the historical behaviour. It would prevent the SharedMDs entry from being reused in this case.

The language reference says: "identified [structure] types are never
uniqued". But llvm-link seems to do just that at link-time:

    > cat x.ll

    %A = type {}
    %B = type {}
    %C = type { %A*, %B* }

    define void @foo(%C* %c) {
      ret void
    }

    > llvm-link x.ll | llvm-dis
    ; ModuleID = '<stdin>'
    source_filename = "llvm-link"

    %C = type { %A*, %A* }
    %A = type {}

    define void @foo(%C* %c) {
        ret void
    }

Does the statement in the language reference not apply to linking?

(This is with a recent 3.9).

Robin

Thanks for your thoughts, Duncan, do you think this is something that
could still be addressed for 3.9?

Also, I just filed another ticket for 2nd problem I'm seeing:
28781 – "Add DWARF path discriminators" is corrupting the IR Probably unrelated, but at
least it seems to be another issue involving llvm-link and debug
information.

Thanks,

Robin

Any suggestions? It's not a problem for me, I'm mostly curious as it
makes it harder to read the IR. Feel free to point me to other
documentation if I'm missing something. Thanks,

Robin

Rafael knows the most about linking types.

IIUC, it's intentional that types such as these get merged at link time. LangRef must be referring to pass pipelines and (de-)serialization. Maybe you could put together a patch to make it more clear?

I imagine, once type-less pointers are finished, we might start uniquing identified types in all cases.