Integer questions

First off, most of my information about the integer representation in
LLVM is from LLVM Language Reference Manual — LLVM 16.0.0git documentation and I could
use some things cleared up.

First, I guess that smaller integer sizes, say, i1 (boolean) are
stuffed into a full word size for the cpu it is compiled on (so 8bits,
or 32 bits or whatever).
What if someone made an i4 and compiled it on 32/64 bit
windows/nix/bsd on a standard x86 or x64 system, and they set the
value to 15 (the max size of an unsigned i4), if it is still rounded
up to the next nearest size when compiled (i8 or i32 or what-not),
what if when that value has 15, but a 1 was added to it, it will be
represented in memory at 16, or if you ignore all but the first 4 bits
it would be zero. Is there any enforcement in the range of a given
integer (in other words, regardless of architecture, would an i4 be
constrained to only be 0 to 15, or is this the realm of the language
to enforce, I would think it would be as having it at LLVM level would
add noticeable overhead on non-machine size integers, and given that
it would be in the realm of the language to deal with, how can the
language be certain what values are directly appropriate for the
architecture it is being compiled on)?
In just a quick guess, I would say that something like an i4 would be
compiled as if it was an i8, treated identically to an i8 in all
circumstances, is this correct?

Second, what if the specified integer size is rather large, say that
an i512 was specified, would this cause a compile error (something
along the lines of the specified integer size being too large to fit
in the machine architecture), or would it rather compile in the
necessary code to do bignum math on it (or would something like that
be in the realm of the language designer, although having it at LLVM
level would also make sense, after all, what best knows how to compile
something for speed on the target system other then the compiler
itself)?
In just a quick guess, I would say that specifying an integer bit size
too large for the machine would cause a compile error, but the docs do
not hint at that (especially with the given example of: i1942652 a
really big integer of over 1 million bits), is this correct?

Third, assuming either or both of the above things had to be
enforced/implemented by the language designer, what would be the best
way for the language to ask LLVM what the appropriate machine integer
sizes are, so that if an i64 is specified, then bignum math would be
done by the language on a 32-bit compile, but would just be a native
int on a 64-bit compile. The reason this is asked instead of just
directly testing the cpu bit (32-bit, 64-bit, whatever) is that some
processors allow double sized integers to be specified, so a 64-bit
integer on some 32-bit cpu's is just fine, as is a 128-bit int on a
64-bit cpu, thus how can I query what are the best appropriate integer
sizes?

Some background on the questions: Making a JIT'd, speed-critical
'scripting-language' for a certain app of mine, integer types have a
bitsize part, like how LLVM does it, i4/s4 is a signed integer of
4-bits, u4 is an unsigned integer of 4-bits, etc...

First off, most of my information about the integer representation in
LLVM is from LLVM Language Reference Manual — LLVM 18.0.0git documentation and I could
use some things cleared up.

Okay... that's a good start :slight_smile:

First, I guess that smaller integer sizes, say, i1 (boolean) are
stuffed into a full word size for the cpu it is compiled on (so 8bits,
or 32 bits or whatever).

The code is compiled so that it works. :slight_smile: At least hopefully; I think
there are still some bugs lurking with unusual types. CodeGen will
use a native register to do arithmetic.

What if someone made an i4 and compiled it on 32/64 bit
windows/nix/bsd on a standard x86 or x64 system, and they set the
value to 15 (the max size of an unsigned i4), if it is still rounded
up to the next nearest size when compiled (i8 or i32 or what-not),
what if when that value has 15, but a 1 was added to it, it will be
represented in memory at 16, or if you ignore all but the first 4 bits
it would be zero. Is there any enforcement in the range of a given
integer (in other words, regardless of architecture, would an i4 be
constrained to only be 0 to 15, or is this the realm of the language
to enforce, I would think it would be as having it at LLVM level would
add noticeable overhead on non-machine size integers, and given that
it would be in the realm of the language to deal with, how can the
language be certain what values are directly appropriate for the
architecture it is being compiled on)?
In just a quick guess, I would say that something like an i4 would be
compiled as if it was an i8, treated identically to an i8 in all
circumstances, is this correct?

An i4 is a four-bit integer; it is guaranteed to act like a true i4
for all arithmetic operations. CodeGen will mask the integers
appropriately to achieve the desired behavior.

Second, what if the specified integer size is rather large, say that
an i512 was specified, would this cause a compile error (something
along the lines of the specified integer size being too large to fit
in the machine architecture), or would it rather compile in the
necessary code to do bignum math on it (or would something like that
be in the realm of the language designer, although having it at LLVM
level would also make sense, after all, what best knows how to compile
something for speed on the target system other then the compiler
itself)?
In just a quick guess, I would say that specifying an integer bit size
too large for the machine would cause a compile error, but the docs do
not hint at that (especially with the given example of: i1942652 a
really big integer of over 1 million bits), is this correct?

The language and the optimizers don't have any issues with such types,
at least in theory.

Ignoring bugs, CodeGen can currently handle anything up to i128 for
all operations on most architectures; if the operations aren't
natively supported, it falls back to calling the implementation in
libgcc.

-Eli

Hi,

First, I guess that smaller integer sizes, say, i1 (boolean) are
stuffed into a full word size for the cpu it is compiled on (so 8bits,
or 32 bits or whatever).

on x86-32, an i1 gets placed in an 8 bit register.

What if someone made an i4 and compiled it on 32/64 bit
windows/nix/bsd on a standard x86 or x64 system, and they set the
value to 15 (the max size of an unsigned i4), if it is still rounded
up to the next nearest size when compiled (i8 or i32 or what-not),

The extra bits typically contain rubbish, but you can't tell.
For example, suppose in the bitcode you decide to print out
the value of the i4 by calling printf. So in the bitcode
you first (say) zero-extend the i4 to an i32 which you pass
to printf. Well, the code-generators will generate the
following (or equivalent) for the zero-extension: mask off
the rubbish bits in the i8 register (i.e. set them to zero)
then zero-extend the result to a full 32 bit register. This
all happens transparently.

what if when that value has 15, but a 1 was added to it, it will be
represented in memory at 16, or if you ignore all but the first 4 bits
it would be zero.

It acts like an i4: the bits corresponding to the i4 will have the
right value (0) while the rest will have some rubbish.

Is there any enforcement in the range of a given
integer (in other words, regardless of architecture, would an i4 be
constrained to only be 0 to 15, or is this the realm of the language
to enforce, I would think it would be as having it at LLVM level would
add noticeable overhead on non-machine size integers, and given that
it would be in the realm of the language to deal with, how can the
language be certain what values are directly appropriate for the
architecture it is being compiled on)?
In just a quick guess, I would say that something like an i4 would be
compiled as if it was an i8, treated identically to an i8 in all
circumstances, is this correct?

No it is not. It acts exactly like an i4, even though on x86-32
the code-generators implement this in an i8. There is a whole
pile of logic in lib/CodeGen/SelectionDAG/Legalize*Types.cpp in
order to get this effect (currently you have to pass -enable-legalize-types
to llc to turn on codegen support for funky integer sizes).

Second, what if the specified integer size is rather large, say that
an i512 was specified, would this cause a compile error (something
along the lines of the specified integer size being too large to fit
in the machine architecture), or would it rather compile in the
necessary code to do bignum math on it (or would something like that
be in the realm of the language designer, although having it at LLVM
level would also make sense, after all, what best knows how to compile
something for speed on the target system other then the compiler
itself)?

The current maximum the code generators support is i256. If you try to
use bigger integers it will work fine in the bitcode, but if you try
to do code generation the compiler will crash.

In just a quick guess, I would say that specifying an integer bit size
too large for the machine would cause a compile error, but the docs do
not hint at that (especially with the given example of: i1942652 a
really big integer of over 1 million bits), is this correct?

No, you can use i256 on a 32 bit machine for example.

Third, assuming either or both of the above things had to be
enforced/implemented by the language designer, what would be the best
way for the language to ask LLVM what the appropriate machine integer
sizes are, so that if an i64 is specified, then bignum math would be
done by the language on a 32-bit compile, but would just be a native
int on a 64-bit compile. The reason this is asked instead of just
directly testing the cpu bit (32-bit, 64-bit, whatever) is that some
processors allow double sized integers to be specified, so a 64-bit
integer on some 32-bit cpu's is just fine, as is a 128-bit int on a
64-bit cpu, thus how can I query what are the best appropriate integer
sizes?

I don't know, sorry.

Ciao,

Duncan.

FYI, there is one other issue here, PR2660. While codegen in
general can handle types like i256, individual targets don't always
have calling convention rules to cover them. For example, returning
an i128 on x86-32 or an i256 on x86-64 doesn't doesn't fit in the
registers designated for returning values on those targets.

Dan

I am mostly just interested in x86 (32-bit and 64-bit) based systems,
so it would choke? And how would it choke (exception thrown, some
getlasterror style message passing, or what?). To ask bluntly, I
would need to do bignum arithmetic in my side, rather then letting
LLVM do it (since the backend would most likely not do it)?

Has anyone thought about putting bignum's inside LLVM itself, LLVM
would be able to generate the best things possible for a given system,
and I do not mean bignum like some arbitrary sized number ala
Python/Erlang/etc. number, some static sized integer would be best for
my use, i2048 for example, although if there were an arbitrary length
version I would put that in the language as well.

Which I guess I should also ask about, how does LLVM do error handling
for when something cannot be compiled for whatever reason?

FYI, there is one other issue here, PR2660. While codegen in
general can handle types like i256, individual targets don't always
have calling convention rules to cover them. For example, returning
an i128 on x86-32 or an i256 on x86-64 doesn't doesn't fit in the
registers designated for returning values on those targets.

I am mostly just interested in x86 (32-bit and 64-bit) based systems,
so it would choke? And how would it choke (exception thrown, some
getlasterror style message passing, or what?). To ask bluntly, I
would need to do bignum arithmetic in my side, rather then letting
LLVM do it (since the backend would most likely not do it)?

If assertions are enabled, it will trigger an assertion failure.
This particular issue is only relevant for function return values.

Has anyone thought about putting bignum's inside LLVM itself, LLVM
would be able to generate the best things possible for a given system,
and I do not mean bignum like some arbitrary sized number ala
Python/Erlang/etc. number, some static sized integer would be best for
my use, i2048 for example, although if there were an arbitrary length
version I would put that in the language as well.

Integers like i2048 that are well beyond the reach of the register
set on x86 pose additional challenges if you want efficient generated
code.

Which I guess I should also ask about, how does LLVM do error handling
for when something cannot be compiled for whatever reason?

The Verifier pass is recommended; it catches a lot of
invalid stuff and be configured to abort, print an error to
stderr, or return an error status and message string.

It doesn't catch everything though; codegen's error
handling today is usually an assertion failure, assuming
assertions are enabled.

Dan

The Verifier pass is recommended; it catches a lot of
invalid stuff and be configured to abort, print an error to
stderr, or return an error status and message string.

It doesn't catch everything though; codegen's error
handling today is usually an assertion failure, assuming
assertions are enabled.

Was looking through some other code in LLVM, I noticed an abort(), how
often do those occur as they would be *really* bad to occur in my
circumstances (the couple I saw were in the JIT class) as catching the
abort signal may not always be possible depending on the hosts code.

Will the verifier catch things like integers that are too big for the
platform that I am currently on, or will that be done by something
like the JIT when it compiles it?

And yea, assertions are a very large "no" in this library, especially
since bad code will probably happen very often (it is effectively a
scripting language, will load, compile, and run code at run time, I do
expect to write out things to a cache to help future loading time
though, time is not really an issue during loading, but there may be a
potentially massive amount of code later on, so better safe then
sorry, not to mention that code speed during execution needs to be as
fast as possible, the normal scripting languages out are no where near
fast enough according to tests, at least the ones that are capable of
continuations)

Which brings me to something else, I will create a new thread due to
the massive topic change...

The Verifier pass is recommended; it catches a lot of
invalid stuff and be configured to abort, print an error to
stderr, or return an error status and message string.

It doesn't catch everything though; codegen's error
handling today is usually an assertion failure, assuming
assertions are enabled.

Was looking through some other code in LLVM, I noticed an abort(), how
often do those occur as they would be *really* bad to occur in my
circumstances (the couple I saw were in the JIT class) as catching the
abort signal may not always be possible depending on the hosts code.

Will the verifier catch things like integers that are too big for the
platform that I am currently on, or will that be done by something
like the JIT when it compiles it?

No, the verifier currently does not know about target-specific
codegen limitations.

And yea, assertions are a very large "no" in this library, especially
since bad code will probably happen very often

Patches to do the kind of checking you're asking about would be
welcome :-). I don't think it makes sense to extend the
Verifier itself here; it's supposed to accept any valid LLVM IR.
I think a separate CodeGenVerifier might be a good approach
though, or possibly extending codegen itself with the ability to
interrupt itself and yield some kind of error status.

Dan

Patches to do the kind of checking you're asking about would be
welcome :-). I don't think it makes sense to extend the
Verifier itself here; it's supposed to accept any valid LLVM IR.
I think a separate CodeGenVerifier might be a good approach
though, or possibly extending codegen itself with the ability to
interrupt itself and yield some kind of error status.

And I presume you all are allergic to exceptions, since I have seen
none so far (probably due to help the C bindings and such), return
error codes all over the place then? If I do any extension on this
(short on time, so good chance I cannot, but if I do) the usual
non-exception style I use is the return value is a status code, the
last argument passed in is an optional std::string* (the return code
gives the basic error reason, the string gives details), anything that
actually needs to be returned are the first arguments as
references/pointers, would that work?

Do note though, that would cause an interface breaking change (just in
the function signatures, but still). An alternate would be to just
pass in a struct that the user can subclass to perform their own error
reporting, call a certain function in it or so before it finally bails
out, it puts a level of indirection so if someone wants to recover
they will have to use their own variables stored somewhere, so it is
much more ugly and would not work well with C style bindings, whereas
the interface breaking change would. Or are there any other styles
that are preferred?

Do note though, that would cause an interface breaking change (just in
the function signatures, but still). An alternate would be to just
pass in a struct that the user can subclass to perform their own error
reporting, call a certain function in it or so before it finally bails
out, it puts a level of indirection so if someone wants to recover
they will have to use their own variables stored somewhere, so it is
much more ugly and would not work well with C style bindings, whereas
the interface breaking change would. Or are there any other styles
that are preferred?

I agree; having an abstract class interface for users to
subclass sounds like over-engineering here.

Patches to do the kind of checking you're asking about would be
welcome :-). I don't think it makes sense to extend the
Verifier itself here; it's supposed to accept any valid LLVM IR.
I think a separate CodeGenVerifier might be a good approach
though, or possibly extending codegen itself with the ability to
interrupt itself and yield some kind of error status.

And I presume you all are allergic to exceptions, since I have seen
none so far (probably due to help the C bindings and such), return
error codes all over the place then? If I do any extension on this
(short on time, so good chance I cannot, but if I do) the usual
non-exception style I use is the return value is a status code, the
last argument passed in is an optional std::string* (the return code
gives the basic error reason, the string gives details), anything that
actually needs to be returned are the first arguments as
references/pointers, would that work?

I don't know what the best approach is. It would need to be something
that doesn't interfere with users who don't need it, and presumably it
would need to be good enough to meet your needs.

Dan

I agree; having an abstract class interface for users to
subclass sounds like over-engineering here.

The advantage of having another class handle such a thing is that
errors that can be recovered from can continue, where as pretty much
any other form of error handling handles it after it already bails.

I don't know what the best approach is. It would need to be something
that doesn't interfere with users who don't need it, and presumably it
would need to be good enough to meet your needs.

I just consider some form of error reporting that reports an error by
killing the program in some form, "very bad". I would accept anything
that just says "Unknown Error", as long as it does not kill the
program (why does it do that anyway?), but would prefer some form of
detailed reporting so I could at least tell the user what was wrong so
they can fix it while the program is running and have it be reloaded
in real-time (yes, the format of my little language does allow new
code to be added during run-time, as well as updating old code, the
style the language is used in allows that to be done with little
issue).

I do not suppose I could just go the cheap route and replace all the
assertions and abort's with a macro that does an
assert/abort/exception depending on whether something is define'd or
not? Would not break anything, it would act identically to as it does
now if no one adds that defined word to their build, and I could get
'good-enough' information from the exception, although with no chance
of recover, would have to restart the operation. This is a bit of a
monstrous hack, but considering the current form of error reporting is
even worse, and this would probably take well less then an hour of
work, I would settle for it until something better is made...

I do have to ask though, who thought "abort"ing the program was good
error reporting?