Inconsistencies or intended behaviour of LLVM IR?

Hello everyone!

I've recently had a chance to familiarize myself with the nitty-gritty details of LLVM IR. It has been a great learning experience, sometimes frustrating or confusing but mostly rewarding.

There are a few cases I've come across which seems odd to me. I've tried to cross reference with the language specification and the source code to the best of my abilities, but would like to reach out to an experienced crowd with a few questions.

Could you help me out by taking a look at these examples? To my novice eyes they seem to highlight inconsistencies in LLVM IR (or the reference implementation), but it is quite likely that I've overlooked something. Please help me out.

Note: the example source files have been attached and a copy is made available at https://github.com/mewplay/ll

* Item 1 - named pointer types

It is possible to create a named array pointer type (and many others), but not a named structure pointer type. E.g.

%x = type [1 x i32]* ; valid.
%x = type {i32}* ; invalid.

Is this the intended behaviour? Attaching a.ll, b.ll, c.ll and d.ll for reference. All files except d.ll compiles without error using clang version 3.5.1 (tags/RELEASE_351/final).

> $ clang d.ll
> d.ll:3:16: error: expected top-level entity
> %x = type {i32}*
> ^
> 1 error generated.

Does it have anything to do with type equality? (just a hunch)

* Item 2 - equality of named types

A named integer type is equivalent to its literal type counterpart, but the same is not true for named and literal structures. I am certain that I've read about this before, but can't seem to locate the right section of the language specification; could anyone point me in the right direction? Also, what is the motivation behind this decision? I've skimmed over the code which handles named structure types (in lib/IR/core.cpp), but would love to hear the high level idea.

Attaching e.ll, f.ll, g.ll and h.ll for reference. All compile just file except h.ll, which produces the following error message (using the same version of clang as above):

> $ clang h.ll
> h.ll:10:23: error: argument is not of expected type '%x = type { i32 }'
> call void (%x)* @foo({i32} {i32 0})
> ^
> 1 error generated.

* Item 3 - zero initialized common linkage variables

According to the language specification common linkage variables are required to have a zero initializer [1]. If so, why are they also required to provide an initial value?

Attaching i.ll and j.ll for reference. Both compiles just fine and once executed i.ll returns 37 and j.ll return 0. If the common linkage variable @x was not initialized to 0, j.ll would have returned 42.

* Item 4 - constant common linkage variables

The language specification states that common linkage variables may not be marked as constant [1]. The parser doesn't seem to enforce this restriction. Would doing so cause any problems?

Attaching k.ll and l.ll for reference. Both compiles just fine, but once executed k.ll returns 37 (e.g. the constant variable was overwritten) while l.ll segfaults as expected when it tries to overwrite a read-only memory location.

* Item 5 - appending linkage restrictions

An extract from the language specification [1]:

> "appending" linkage may only be applied to global variables of pointer to array type.

Similarly to item 4 this restriction isn't enforced by the parser. Would it make sense doing so, or is there any problem with such an approach?

* Item 6 - hash token

The hash token (#) is defined in lib/AsmParser/LLToken.h (release version 3.5.0 of the LLVM source code) but doesn't seem to be used anywhere else in the source tree. Is this token a historical artefact or does it serve a purpose?

* Item 7 - backslash token

Similarly to item 7 the backslash token doesn't seem to serve a purpose (with regards to release version 3.5.0 of the LLVM source code). Is it used somewhere?

* Item 8 - quoted labels

A comment in lib/AsmParser/LLLexer.cpp (once again, release version 3.5.0 of the LLVM source code) describes quoted labels using the following regexp (e.g. at least one character between the double quotes):

> /// QuoteLabel "[^"]+":

In contrast the reference implementation accepts quoted labels with zero or more characters between the double quotes. Which is to be trusted? The comment makes more sense as the variable name would effectively be blank otherwise.

* Item 9 - undocumented calling conventions

The following calling conventions are valid tokens but not described in the language references as of revision 223189:

intel_ocl_bicc, x86_stdcallcc, x86_fastcallcc, x86_thiscallcc, kw_x86_vectorcallcc, arm_apcscc, arm_aapcscc, arm_aapcs_vfpcc, msp430_intrcc, ptx_kernel, ptx_device, spir_kernel, spir_func, x86_64_sysvcc, x86_64_win64cc, kw_ghccc

Lastly I'd just like to thank the LLVM developers for all the time and hard work they've put into this project. I'd especially like to thank you for providing a language specification along side of the reference implementation! Keeping it up to date is a huge task, but also hugely important. Thank you!

Kind regards
/Robin Eklind

[1]: http://llvm.org/docs/LangRef.html#linkage-types

a.ll (169 Bytes)

b.ll (167 Bytes)

c.ll (157 Bytes)

d.ll (163 Bytes)

e.ll (157 Bytes)

f.ll (158 Bytes)

g.ll (165 Bytes)

h.ll (168 Bytes)

i.ll (152 Bytes)

j.ll (129 Bytes)

k.ll (154 Bytes)

l.ll (147 Bytes)

m.ll (131 Bytes)

A couple quick comments inline (didn’t touch on all points):

Hello Sean,

Thank you for your reply. I'll give your suggestion to item 6 and 7 a try tonight. I'll start a compilation and let it run throughout the night. My laptop (x61s) is 8 years old by know, so compiling LLVM takes a little time :slight_smile:

Regarding item 8. I don't know if anyone is using "": in the wild so fixing the implementation might make sense. If not the documentation (e.g. the QuoteLabel comment) should be updated to be in line with the implementation.

I only included item 9 since I stumbled upon it once cross-referencing the source code with the language specification. Bitrot for a project of this size is to be expected.

I'm still very interested to hear about the items related to types, e.g. item 1 and 2. Is there a good reference which describes how type equality works in LLVM IR? If the source code is the reference, could someone with the high level knowledge get me up to speed?

Item 1 still confuses me, so I'd be very happy if someone with more insight could clarify if this is the intended behaviour and if so the motivation behind it.

As it so happens, I forgot to include item 10 :slight_smile:

* Item 10 - lli vs. clang output

Using the same source files as before, it seems like lli and clang treats common linkage and constant variables differently. The following execution demonstrates the return value after executing i.ll, j.ll, k.ll and l.ll with lli and clang respectively:

> $ clang i.ll && ./a.out ; echo $?
> 37
>
> $ lli i.ll ; echo $?
> 37
>
> $ clang j.ll && ./a.out ; echo $?
> 0
>
> $ lli j.ll ; echo $?
> 42
>
> $ clang k.ll && ./a.out ; echo $?
> 37
>
> $ lli k.ll ; echo $?
> 37
>
> $ clang l.ll && ./a.out ; echo $?
> Segmentation fault
> 139
>
> $ lli l.ll ; echo $?
> 37

Looking forward to hear more about type equality, or get a pointer as to where I can read up about it.

Cheers /Robin Eklind

Hello Sean,

Thank you for your reply. I'll give your suggestion to item 6 and 7 a try
tonight. I'll start a compilation and let it run throughout the night. My
laptop (x61s) is 8 years old by know, so compiling LLVM takes a little time
:slight_smile:

This is why I did so much documentation work when in college. The docs
build much faster.

Regarding item 8. I don't know if anyone is using "": in the wild so
fixing the implementation might make sense. If not the documentation (e.g.
the QuoteLabel comment) should be updated to be in line with the
implementation.

FYI the textual IR doesn't have a compatibility guarantee (we try not to
egregiously change it, but users don't expect .ll to work across versions).

I only included item 9 since I stumbled upon it once cross-referencing the
source code with the language specification. Bitrot for a project of this
size is to be expected.

I'm still very interested to hear about the items related to types, e.g.
item 1 and 2. Is there a good reference which describes how type equality
works in LLVM IR? If the source code is the reference, could someone with
the high level knowledge get me up to speed?

Off the top of my head maybe
http://blog.llvm.org/2011/11/llvm-30-type-system-rewrite.html

Item 1 still confuses me, so I'd be very happy if someone with more
insight could clarify if this is the intended behaviour and if so the
motivation behind it.

As it so happens, I forgot to include item 10 :slight_smile:

* Item 10 - lli vs. clang output

Using the same source files as before, it seems like lli and clang treats
common linkage and constant variables differently. The following execution
demonstrates the return value after executing i.ll, j.ll, k.ll and l.ll
with lli and clang respectively:

> $ clang i.ll && ./a.out ; echo $?
> 37
>
> $ lli i.ll ; echo $?
> 37
>
>
> $ clang j.ll && ./a.out ; echo $?
> 0
>
> $ lli j.ll ; echo $?
> 42
>
>
> $ clang k.ll && ./a.out ; echo $?
> 37
>
> $ lli k.ll ; echo $?
> 37
>
>
> $ clang l.ll && ./a.out ; echo $?
> Segmentation fault
> 139
>
> $ lli l.ll ; echo $?
> 37

Some of these linkage combinations and operations have dubious semantics.
Talking briefly with Rafael Espindola over a build, sounds like we should
mostly tighten up the verifier to remove some of these weird cases. For
example, storing to a constant is sort of .... I'm sort of surprised it
works at all.

-- Sean Silva

(forgot to cc the list)

Thank you for reviewing and commiting the patch Sean :slight_smile: It was the first one I've ever submitted to LLVM and the whole process was really smooth! Using Phabricator with GitHub OAuth login was brilliant as it removed one more step for new contributors. I also feel very happy that the first patch ended up removing more code than it introduced :slight_smile: Not likely to speed up the compilation process by a lot, but one can hope to keep the trend!

I read the blog post about the type system rewrite. Thank you for the link. It did clear up a lot of my uncertainties, but introduced a new one. Could you help me make sense of this part, which was presented under the "Identified structs have a 1-1 mapping with a name" section.

> "... and the only types that can be named are identified structs"

Does this mean that other types cannot be named? What about type type "%x" in b.ll? It seems like I'm interpreting this in the wrong way. Could you help me make this clear? Is there a difference between a named type and an identified type (or are those two ways of saying the same thing)? If types other than structures can be given names, does this name impact type equality somehow?

To keep up with the spirit of the original topic here are a few more items :slight_smile:

* Item 11 - hexadecimal integer constants

The lexer handles hexadecimal integer constants, e.g. from lib/AsmParser/LLLexer.cpp

> /// HexIntConstant [us]0x[0-9A-Fa-f]+

This representation of integer constants is not mentioned in the language specification as far as I can tell.

* Item 12 - constant expressions

The documentation of sext states that the bit size of the constant must be smaller than the target type, but the implementation also accepts constants which have the same size as the target type. E.g. the documentation should be updated or the implementation made more strict.

> sext (CST to TYPE)
> Sign extend a constant to another type. The bit size of CST must be smaller than the bit size of TYPE. Both types must be integers.

The same goes for the trunc, zext, sext, fptrunc and fpext operations. Some refer to larger instead of smaller but none states that types of equal size is allowed.

* Item 13 - LocalVar and LocalID for named types

This is more of a question. Why are types referred to using local names "%x" instead of global names "@x"? It seems inconsistent as local names are scoped to the function; a local variable name in one function refers to a different value from a local variable name in another. Since types are scoped to the module wouldn't a global name make more sense?

As always, I'm eager to hear more about the type system in particular. The compilation timed in at 120m36.240s while the test cases took 32m10.111s. It will be interesting to see if this goes up or down as time passes :slight_smile:

Cheers /Robin Eklind

(forgot to cc the list)

Answers, questions and assumptions are inlined in the response.

If someone with knowledge of the LLVM IR type system could take a look at my assumptions below I'd be very happy.

Thank you for reviewing and commiting the patch Sean :slight_smile: It was the first
one I've ever submitted to LLVM and the whole process was really smooth!
Using Phabricator with GitHub OAuth login was brilliant as it removed one
more step for new contributors. I also feel very happy that the first patch
ended up removing more code than it introduced :slight_smile: Not likely to speed up
the compilation process by a lot, but one can hope to keep the trend!

Great!

I read the blog post about the type system rewrite. Thank you for the
link. It did clear up a lot of my uncertainties, but introduced a new one.
Could you help me make sense of this part, which was presented under the
"Identified structs have a 1-1 mapping with a name" section.

"... and the only types that can be named are identified structs"

Does this mean that other types cannot be named? What about type type "%x"
in b.ll? It seems like I'm interpreting this in the wrong way. Could you
help me make this clear? Is there a difference between a named type and an
identified type (or are those two ways of saying the same thing)? If types
other than structures can be given names, does this name impact type
equality somehow?

I'll need to punt to someone else for these questions. I haven't dealt with
this part of the IR in a while.

Anyone else knowledgeable in this area? I would like to list a set of assumptions that I've made after reading the blog post and experimenting with the reference implementation. If anyone could verify these assumptions, and of cause point out which are incorrect, I'd be very grateful.

* Assumption 1 - all types can be given a name, not only structures.
* Assumption 2 - the type name works as an alias for all types except structures, and it is ignored when calculating type equality.
* Assumption 3 - for structures the type name works as an identity, and type equality depends on it.
* Assumption 4 - type equality is calculated by comparing the base type (e.g. the underlying type of a type name identifier) of one type against another (recursively and for each element in the case of vectors, arrays and other derived types). In the case of identified structures the comparison is made strictly based on the structure's name, and in the case of structure literals the comparison is made in the same way as for other derived types.

To keep up with the spirit of the original topic here are a few more items
:slight_smile:

* Item 11 - hexadecimal integer constants

The lexer handles hexadecimal integer constants, e.g. from
lib/AsmParser/LLLexer.cpp

/// HexIntConstant [us]0x[0-9A-Fa-f]+

This representation of integer constants is not mentioned in the language
specification as far as I can tell.

I assume you are talking about the 'u' and 's' prefix? That seems like a
historical artifact. The type system doesn't have signedness so there is no
sense in which a constant can be "signed" or "unsigned". In fact, most
places that even look at the signedness of the lexer's APSIntVal it's just
to issue an error. A patch removing this old cruft would be great.

I'd be happy to remove this old cruft :slight_smile: Just want to make sure I understood correctly. Are you referring to the prefix or the whole HexIntConstant representation? Because if we simply remove the prefix it would collide with the hexadecimal representation of floating point constants.

It seems like clang has been using HexIntConstants in the past (and maybe still?), based on the following comment from lib/AsmParser/LLLexer.cpp:

> // Check for [us]0x[0-9A-Fa-f]+ which are Hexadecimal constant generated by
> // the CFE to avoid forcing it to deal with 64-bit numbers.

Is clang still using this representation? If not, I'll start preparing a patch to get rid of the HexIntConstant parsing :slight_smile:

* Item 12 - constant expressions

The documentation of sext states that the bit size of the constant must be
smaller than the target type, but the implementation also accepts constants
which have the same size as the target type. E.g. the documentation should
be updated or the implementation made more strict.

sext (CST to TYPE)
    Sign extend a constant to another type. The bit size of CST must be

smaller than the bit size of TYPE. Both types must be integers.

The same goes for the trunc, zext, sext, fptrunc and fpext operations.
Some refer to larger instead of smaller but none states that types of equal
size is allowed.

Probably worth updating the documentation to what is actually allowed by
the code. Could you please send a patch to LangRef? (and for convenience,
can you point to the relevant source code for citation?).

I'll try to look into it. So far I've not found this in the source code, but rather by examining the behaviour of compiling .ll files with clang.

* Item 13 - LocalVar and LocalID for named types

This is more of a question. Why are types referred to using local names
"%x" instead of global names "@x"? It seems inconsistent as local names are
scoped to the function; a local variable name in one function refers to a
different value from a local variable name in another. Since types are
scoped to the module wouldn't a global name make more sense?

I doubt there's a particular rationale. I wouldn't pay too much attention
to the sigils. They are pretty much arbitrary and just to make the lexer
simpler, similar to using introducer keywords makes the parser simpler.

A more concerning inconsistency regarding sigils (if choice of sigils were
to be concerning) is the use of the same sigils for types and values. Types
are a purely compile-time thing while locals and globals actually
correspond to materializable run-time values (slightly muddled by things
like dbg.declare and llvm.assume).

Would it make sense to start a discussion about this inconsistency where the same sigil is used for types and values? It the compatibility between releases is ensured using the Bitcode format, it may be possible to introduce a patch to the assembly representation of LLVM IR. To port old files to the new representation one could convert .ll files to .bc using the current version of llvm-as, and then convert back using a newer version of llvm-dis. I can understand if this is a low priority issue, but discussing and fixing any inconsistency in the language makes sense and pays off in the long run.

(forgot to cc the list)

Answers, questions and assumptions are inlined in the response.

If someone with knowledge of the LLVM IR type system could take a look at
my assumptions below I'd be very happy.

Thank you for reviewing and commiting the patch Sean :slight_smile: It was the first

one I've ever submitted to LLVM and the whole process was really smooth!
Using Phabricator with GitHub OAuth login was brilliant as it removed one
more step for new contributors. I also feel very happy that the first
patch
ended up removing more code than it introduced :slight_smile: Not likely to speed up
the compilation process by a lot, but one can hope to keep the trend!

Great!

I read the blog post about the type system rewrite. Thank you for the
link. It did clear up a lot of my uncertainties, but introduced a new
one.
Could you help me make sense of this part, which was presented under the
"Identified structs have a 1-1 mapping with a name" section.

"... and the only types that can be named are identified structs"

Does this mean that other types cannot be named? What about type type
"%x"
in b.ll? It seems like I'm interpreting this in the wrong way. Could you
help me make this clear? Is there a difference between a named type and
an
identified type (or are those two ways of saying the same thing)? If
types
other than structures can be given names, does this name impact type
equality somehow?

I'll need to punt to someone else for these questions. I haven't dealt
with
this part of the IR in a while.

Anyone else knowledgeable in this area? I would like to list a set of
assumptions that I've made after reading the blog post and experimenting
with the reference implementation. If anyone could verify these
assumptions, and of cause point out which are incorrect, I'd be very
grateful.

* Assumption 1 - all types can be given a name, not only structures.
* Assumption 2 - the type name works as an alias for all types except
structures, and it is ignored when calculating type equality.
* Assumption 3 - for structures the type name works as an identity, and
type equality depends on it.
* Assumption 4 - type equality is calculated by comparing the base type
(e.g. the underlying type of a type name identifier) of one type against
another (recursively and for each element in the case of vectors, arrays
and other derived types). In the case of identified structures the
comparison is made strictly based on the structure's name, and in the case
of structure literals the comparison is made in the same way as for other
derived types.

There are quite a few people on the list that can answer this. Just a
matter of waiting for one of them to pipe up.

To keep up with the spirit of the original topic here are a few more
items
:slight_smile:

* Item 11 - hexadecimal integer constants

The lexer handles hexadecimal integer constants, e.g. from
lib/AsmParser/LLLexer.cpp

/// HexIntConstant [us]0x[0-9A-Fa-f]+

This representation of integer constants is not mentioned in the language
specification as far as I can tell.

I assume you are talking about the 'u' and 's' prefix? That seems like a
historical artifact. The type system doesn't have signedness so there is
no
sense in which a constant can be "signed" or "unsigned". In fact, most
places that even look at the signedness of the lexer's APSIntVal it's just
to issue an error. A patch removing this old cruft would be great.

I'd be happy to remove this old cruft :slight_smile: Just want to make sure I
understood correctly. Are you referring to the prefix or the whole
HexIntConstant representation? Because if we simply remove the prefix it
would collide with the hexadecimal representation of floating point
constants.

If we don't currently accept 0xDEADBEEF as an integer constant, then it's
probably safe to remove HexIntConstant altogether. That u and s prefixed
stuff is clearly out of date by several years, so clearly nobody is relying
on this if that is the only way to get a hex integer constant.

It seems like clang has been using HexIntConstants in the past (and maybe
still?), based on the following comment from lib/AsmParser/LLLexer.cpp:

> // Check for [us]0x[0-9A-Fa-f]+ which are Hexadecimal constant generated
by
> // the CFE to avoid forcing it to deal with 64-bit numbers.

Is clang still using this representation? If not, I'll start preparing a
patch to get rid of the HexIntConstant parsing :slight_smile:

I don't think any code inside of clang ever directly writes .ll files; it
all happens via the llvm libraries. So all you need to make sure is that
nowhere inside the llvm libraries will write out .ll which has this
construct.

* Item 12 - constant expressions

The documentation of sext states that the bit size of the constant must
be
smaller than the target type, but the implementation also accepts
constants
which have the same size as the target type. E.g. the documentation
should
be updated or the implementation made more strict.

sext (CST to TYPE)

    Sign extend a constant to another type. The bit size of CST must be

smaller than the bit size of TYPE. Both types must be integers.

The same goes for the trunc, zext, sext, fptrunc and fpext operations.
Some refer to larger instead of smaller but none states that types of
equal
size is allowed.

Probably worth updating the documentation to what is actually allowed by
the code. Could you please send a patch to LangRef? (and for convenience,
can you point to the relevant source code for citation?).

I'll try to look into it. So far I've not found this in the source code,
but rather by examining the behaviour of compiling .ll files with clang.

Surely there is somewhere in the llvm libraries where we either reject or
accept (through inaction) extension/truncation to types of the same size.
Maybe the verifier?

* Item 13 - LocalVar and LocalID for named types

This is more of a question. Why are types referred to using local names
"%x" instead of global names "@x"? It seems inconsistent as local names
are
scoped to the function; a local variable name in one function refers to a
different value from a local variable name in another. Since types are
scoped to the module wouldn't a global name make more sense?

I doubt there's a particular rationale. I wouldn't pay too much attention
to the sigils. They are pretty much arbitrary and just to make the lexer
simpler, similar to using introducer keywords makes the parser simpler.

A more concerning inconsistency regarding sigils (if choice of sigils were
to be concerning) is the use of the same sigils for types and values.
Types
are a purely compile-time thing while locals and globals actually
correspond to materializable run-time values (slightly muddled by things
like dbg.declare and llvm.assume).

Would it make sense to start a discussion about this inconsistency where
the same sigil is used for types and values? It the compatibility between
releases is ensured using the Bitcode format, it may be possible to
introduce a patch to the assembly representation of LLVM IR. To port old
files to the new representation one could convert .ll files to .bc using
the current version of llvm-as, and then convert back using a newer version
of llvm-dis. I can understand if this is a low priority issue, but
discussing and fixing any inconsistency in the language makes sense and
pays off in the long run.

I don't think anybody really cares about the sigils. They are just there to
simplify the lexer/parser code. In this case, the complexity of
reconstructing the .ll files *including the FileCheck comments* is probably
not worth it (especially since any mistakes effectively end up silently
reducing our test coverage).

-- Sean Silva

I would prefer it if we kept hex integer literals. The .ll syntax mostly exists to support compiler developers reasoning about the code. If you hand write a .ll file, the hex syntax can be very handy. Besides, we need to parse floating point hex constants anyway.

I would prefer it if we kept hex integer literals. The .ll syntax mostly
exists to support compiler developers reasoning about the code. If you hand
write a .ll file, the hex syntax can be very handy. Besides, we need to
parse floating point hex constants anyway.

What I was saying is that I don't think we have a way to make a hex integer
literal without the 'u' or 's' prefix. I assume nobody uses the 'u' or 's'
prefix, so there's no harm in removing the parsing of such prefixed
constants (i.e. all hex integer constants AFAICT). AFAICT, 0xdeadbeef is
considered to have floating point type.

E.g.:

Sean:~/pg/llvm/test/Integer % cat ~/tmp/testhexconstant.ll

define i32 @foo() {

  ret i32 0xdeadbeef

}

Sean:~/pg/llvm/test/Integer % ~/pg/release/bin/llvm-as
<~/tmp/testhexconstant.ll

*/Users/Sean/pg/release/bin/llvm-as: <stdin>:2:11: **error: **floating
point constant invalid for type*

  ret i32 0xdeadbeef

* ^*

zsh: exit 1 ~/pg/release/bin/llvm-as < ~/tmp/testhexconstant.ll

Grepping the repository finds some interesting related stuff:

Sean:~/pg/llvm/test % git grep 'i32.*0x'

We seem to have cases where we add explicit comments regarding the hex form
of an integer literal, presumably due to lack of an ability to just write
one, e.g.:

CodeGen/AArch64/bitfield-insert.ll: %oldval_keep = and *i32 %oldval,
2214592511 ; =0x*83ffffff

We also seem to have a test verifying that we accept a hacky workaround:

Feature/fold-fpcast.ll: ret *i32 bitcast(float 0x*400D9999A0000000 to i32)

-- Sean Silva

Hello everyone!

I've recently had a chance to familiarize myself with the nitty-gritty
details of LLVM IR. It has been a great learning experience, sometimes
frustrating or confusing but mostly rewarding.

There are a few cases I've come across which seems odd to me. I've tried
to cross reference with the language specification and the source code to
the best of my abilities, but would like to reach out to an experienced
crowd with a few questions.

Could you help me out by taking a look at these examples? To my novice
eyes they seem to highlight inconsistencies in LLVM IR (or the reference
implementation), but it is quite likely that I've overlooked something.
Please help me out.

Note: the example source files have been attached and a copy is made
available at https://github.com/mewplay/ll

* Item 1 - named pointer types

It is possible to create a named array pointer type (and many others), but
not a named structure pointer type. E.g.

%x = type [1 x i32]* ; valid.
%x = type {i32}* ; invalid.

Is this the intended behaviour? Attaching a.ll, b.ll, c.ll and d.ll for
reference. All files except d.ll compiles without error using clang version
3.5.1 (tags/RELEASE_351/final).

Only struct types may be named. What you're seeing is an artifact of the
.ll parser compatibility-supporting the old (llvm 2.x) syntax. In the array
case, the resulting llvm::Module does not have any type named %x. In the
struct case, it's a hard error as you noticed. LLVM 2.x used to permit all
types to have names.

$ clang d.ll
> d.ll:3:16: error: expected top-level entity
> %x = type {i32}*
> ^
> 1 error generated.

Does it have anything to do with type equality? (just a hunch)

* Item 2 - equality of named types

A named integer type is equivalent to its literal type counterpart, but
the same is not true for named and literal structures.

Right. Since named non-struct types don't exist, what's really going on is
that the .ll parser remembers %name to Type* mapping and uses that all
over. Hence they're pointer equivalent. For structs, this is not so,
structs with identical contents but different names are different.

I am certain that I've read about this before, but can't seem to locate the

right section of the language specification; could anyone point me in the
right direction? Also, what is the motivation behind this decision? I've
skimmed over the code which handles named structure types (in
lib/IR/core.cpp), but would love to hear the high level idea.

Attaching e.ll, f.ll, g.ll and h.ll for reference. All compile just file
except h.ll, which produces the following error message (using the same
version of clang as above):

> $ clang h.ll
> h.ll:10:23: error: argument is not of expected type '%x = type { i32 }'
> call void (%x)* @foo({i32} {i32 0})
> ^
> 1 error generated.

* Item 3 - zero initialized common linkage variables

According to the language specification common linkage variables are
required to have a zero initializer [1]. If so, why are they also required
to provide an initial value?

I don't know but I can guess. We want code that checks for an initial value
(via the C++ API) to only look in one place, GV->getInitializer(), instead
of adding a check for isCommon() at each call site.

Of course we could make the .ll text for this whatever we want, but having
a zero initializer requirement more closely matches what's going on with
the objects in memory.

Attaching i.ll and j.ll for reference. Both compiles just fine and once

executed i.ll returns 37 and j.ll return 0. If the common linkage variable
@x was not initialized to 0, j.ll would have returned 42.

* Item 4 - constant common linkage variables

The language specification states that common linkage variables may not be
marked as constant [1]. The parser doesn't seem to enforce this
restriction. Would doing so cause any problems?

In general, restrictions are enforced by the verifier, not the .ll parser.
The verifier operates on the in-memory model and is the source of truth for
validity of IR.

$ cat a.ll
@x = common global i32 1
$ llvm-as a.ll
llvm-as: assembly parsed, but does not verify as correct!
'common' global must have a zero initializer!
i32* @x

All passes are expected to assume that their inputs pass the verifier, and
are permitted to executed undefined behaviour if they do not. All passes
are expected to leave the IR in a state where the verifier passes (on the
assumption that the input did). Same with bitcode reader and writer. There
are some utility functions that are used during the execution of a pass
which cannot make this assumption since the IR may be invalid during a
larger transformation.

Attaching k.ll and l.ll for reference. Both compiles just fine, but once

executed k.ll returns 37 (e.g. the constant variable was overwritten) while
l.ll segfaults as expected when it tries to overwrite a read-only memory
location.

* Item 5 - appending linkage restrictions

An extract from the language specification [1]:

> "appending" linkage may only be applied to global variables of pointer
to array type.

Similarly to item 4 this restriction isn't enforced by the parser. Would
it make sense doing so, or is there any problem with such an approach?

Same as above, it's in the verifier.

* Item 6 - hash token

The hash token (#) is defined in lib/AsmParser/LLToken.h (release version
3.5.0 of the LLVM source code) but doesn't seem to be used anywhere else in
the source tree. Is this token a historical artefact or does it serve a
purpose?

It's gone! This was removed in r227442.

* Item 7 - backslash token

Similarly to item 7 the backslash token doesn't seem to serve a purpose
(with regards to release version 3.5.0 of the LLVM source code). Is it used
somewhere?

Yep, again.

* Item 8 - quoted labels

A comment in lib/AsmParser/LLLexer.cpp (once again, release version 3.5.0
of the LLVM source code) describes quoted labels using the following regexp
(e.g. at least one character between the double quotes):

> /// QuoteLabel "[^"]+":

In contrast the reference implementation accepts quoted labels with zero
or more characters between the double quotes. Which is to be trusted? The
comment makes more sense as the variable name would effectively be blank
otherwise.

I think this is a bug. Well, two bugs:

$ cat a.ll
@"" = internal constant i32 0
@0 = internal constant i32 0
$ llvm-as a.ll
llvm-as: a.ll:2:1: error: variable expected to be numbered '%1'
@0 = internal constant i32 0
^

Anonymous values are numbered, one set of numberings for local variables
(including arguments) and one for globals. I think that @"" should not be
anonymous, but llvm-as clearly thinks it is. If you check llvm::Value's
getValueName() method, we clearly support a distinction between an empty
string and no string.

The other bug is in the error message. The variable should be numbered '@1'
not '%1'.

* Item 9 - undocumented calling conventions

The following calling conventions are valid tokens but not described in
the language references as of revision 223189:

intel_ocl_bicc, x86_stdcallcc, x86_fastcallcc, x86_thiscallcc,
kw_x86_vectorcallcc, arm_apcscc, arm_aapcscc, arm_aapcs_vfpcc,
msp430_intrcc, ptx_kernel, ptx_device, spir_kernel, spir_func,
x86_64_sysvcc, x86_64_win64cc, kw_ghccc

Ooh. Yes, these should be documented!

Nick

Lastly I'd just like to thank the LLVM developers for all the time and hard