Integer handling

I have ended up making a rather complicated type system to handle
quite a few things, all stemming from the fact that the integer type
has no sign information (even if it did not use it, would be nice to
look at it in my code generator to so I can put out one function type
or another when something like udiv/sdiv is called). I essentially
have a variant that holds either an llvm::Type* (which at this point
pretty much just holds the primitives, like the floating point types,
labels, etc...), or it holds a structure I made that just holds an
llvm::Type integer, but represents that integer as unsigned (where-as
if it is in the llvm::Type part of the variant it is treated as
signed), and a few other things. The other things include basically
any type that can hold any other type, from arrays to structs and so
forth. It was easy enough, I had made on the order of a crap-ton of
static_visitors to handle the variant in all its very fast, but still
overwhelming uglyness. I have got to the point where I need need to
represent the function type, this has suddenly introduced a whole new
class of uglyness to handle as the types can propagate into the blocks
and so forth.

I am just wondering if I am missing some very simple way to handle
this. Such as, is there any way in LLVM to 'typedef' (not alias) one
of the integer types (say, an i32) to be another 'type' in such a way
that my codegen can read it and handle the output instructions
different. I have even been thinking of just doing something as
simple as making it so even bitsized integers are considered signed
and odd are unsigned (then converting the odd ones to an even on a
word boundary, so a signed and unsigned i32 would be generated the
same when finished being compiled, but would be represented during
creation as the signed being i32 and the unsigned as an i33, or
vice-versa, whichever).

Does anyone have any ideas about handling this in a much cleaner way
that does not involve creating my own entire type system? How do
others handle this, or do other even expose multiple integer types or
just handle them all as signed or something?

My mind is starting to go numb from all the type system creation over
the past three days (which, although is very powerful, added quite a
few features over llvm that I do not really need, but still...), and I
cannot even get to my point of starting to split up functions into
parts until I get either my type system sorted through it (which will
probably take another week... at least), or I dump it all and use some
more 'llvm'ish way?

Hi, I had the same problem, but I don't have a great solution I'm afraid.

If you search the archives there was a suggestion to perhaps look at
the annotation intrinsics, but those weren't helpful for me. You'll
also find some rationale as to why the sign information is removed:
basically because it makes the backend cleaner/simpler/smaller.

It's a bit unfortunate, but as far as I understand, you have to just
deal with having two type systems in the front end.

scott

Yes, the answer is to have your own type system in your front-end. We designed the LLVM IR to be optimizable, not to let you avoid defining a type system :slight_smile:

-Chris

I know why it was removed, and it does make sense, just would be nice
if there was an option to be able to get two pointer to a specific
llvm::IntegerType, functionally they would be identical, but for user
code (hence, my code) would be useful as I could match it for the
different ones at generate different code for each.

With some discussion with others I think we came up with an acceptable
method (which allows me to completely kill my type duplication system,
thank god). I am going to go down the area of
Java/Python/Lua/whatever_else and just have normal operators (/,
shifts, etc...) act as if the integers are signed. But there will
also be named ops (udiv/sdiv) which can be used in place of a symbol
op (such as /) to be explicit for those who really need unsigned
usage. I plan for the operators to actually all be named (so instead
of + someone could put add instead for example), and am just allowing
the symbol ones to allow it to be easier to pick up for
non-such-low-level-programmers. I think it is a decent enough
compromise (and it was not entirely my idea, I like the back-end ugly
coding, not front-end pretty syntax :slight_smile: ).

The main issue I was having with my type system was not that is was
hard (I think it is actually well designed and powerful), it just
caused me to have to write 'wrappers' around near everything in llvm,
from functions and blocks to expressions and all, it literally just
started snowballing, and since I am programming by myself, I need
something a little more efficient in this case...

Either way, what do you think of the above style for handling integers
now, think it will work? See any major issues with it? You think
people would have trouble using that style?

OvermindDL1 wrote:

I know why it was removed, and it does make sense, just would be nice
if there was an option to be able to get two pointer to a specific
llvm::IntegerType, functionally they would be identical, but for user
code (hence, my code) would be useful as I could match it for the
different ones at generate different code for each.

Well a language doesn't need to know whether an int is signed or unsigned if
all the operators are aware of the sign of its operands. Such is the case
with most assembly languages, and now LLVM. It makes the assembly cleaner
(because types are used only for checking, not for overloading operations).
And it seems you're going this way with your language (based on what you
said in the most recent post).

The thing is that high level languages really should encode signed/unsigned
into the type system (or simply deal only with signed integers, as many
languages do). You don't want your human programmers having to worry about
the signedness of an int each time they do anything with it.

OvermindDL1 wrote:

With some discussion with others I think we came up with an acceptable
method (which allows me to completely kill my type duplication system,
thank god).
*snip*
Either way, what do you think of the above style for handling integers
now, think it will work? See any major issues with it? You think
people would have trouble using that style?

Well it will certainly work, but it's a pretty low-level / unsafe feature to
have in any language higher than assembly language. I wouldn't like it if I
was a user. But at least if I want to I can pretend there aren't any
unsigned ints and go about my merry way (so that's OK, the only problem is
when I actually want to use an unsigned int).

The thing is, practically any language higher than assembly is going to
require a type system, and that type system is almost certainly going to be
different to LLVM's. (eg. How do you distinguish dynamic arrays from
pointers? How do you give names to structure types? How do you distinguish
characters from i8s? Strings from arrays of i8s?) LLVM can't possibly
provide a rich enough type system for all the front-end languages; its type
system isn't designed to be used that way. The cleanest compiler design (in
my opinion) will have a completely type-checked program before it touches
the LLVM API at all. The compiler should then always generate type-correct
LLVM code (so your users should never see LLVM type errors, if your compiler
is behaving correctly).

The language I am making is not a traditional scripting language, it
is designed for heavy math work. It has not classes, the basic data
structure is a struct, yet even those are only used to pass messages.
It is using the Actor-Oriented model, not Object-Oriented. I have
been creating it to deal with taking the heavy computations and
'offloading' them from the main program in such a way that it can take
advantage of multiple cpu cores, or even multiple computers, in a more
transparent way, so I was intending for it to be pretty low-level
regardless (it is not designed to be a standard scripting language, if
I want a 'pretty' easy-to-use scripting language then I always bind in
Python).

You still gave me thoughts to munch on though. It could be more
widely useful for others if I made the most complex data structures
able to be referenced by names instead of types, but that introduces
message complexities like requiring the end-points in the system to
have the same code loaded (which the Actor model does not require) to
be able to link the names with a relevant data structure (might look
at Erlang for an interpreted implementation of the Actor model).

Looking at it, the Actor model (since it is so ancient) really has no
documentation on any higher level types (even strings do not exist,
just arrays to store 'string-like' things). I should see about coming
up with some 'obvious' higher-level constructs to see if they would
fit in well with the system. I do not want anything complex as that
just makes message passing more difficult.

But for note, in the Actor model there are no global variables (there
is no global state at all actually, unlike near every other language
in existence), there are no pointers (as you might accidentally try to
send a pointer in a message, which 'might' work if the end-point actor
was on the same system, but that is not definite). The only
'pointer-like' constructs are in function arguments, and only to local
variables on the actor's stack, since those are very controlled.

By keeping the type system based on the actual types it allows
arbitrary message passing to any other actor without needing to load
any code relating to the actors, you can just send a structure with
the appropriate ID and format and it will 'just work' as the pattern
matching will ensure the other actor handles it correctly, or it gets
dumped with a message stating no match. I guess I could have 'message
header' files to define message types and allow those to be transfered
across systems. A 'name' for a structure would still generally be
larger then the type description of that actor when serialized though,
which kind of defeats the purpose unless some method was put in to
match message struct names with some sort of global identifier
database across the whole system that is synced on all machines.
Problem with that is if two systems setup a connection between them,
that could be a lot of data to have to sync up, which could cause
mismatches in the mean time, which again makes it sound like that type
matching is safer again. If I did have the message name itself
transfered and did not mind it taking up extra bandwidth there could
still be issues if another user created a message of the same name but
a different type, how would it match it, where-as the Erlang style of
ID's and pattern matching of the types is still safer...

So many things to consider...

OvermindDL1 wrote:

It is using the Actor-Oriented model, not Object-Oriented.
/* snip */
By keeping the type system based on the actual types it allows
arbitrary message passing to any other actor without needing to load
any code relating to the actors, you can just send a structure with
the appropriate ID and format and it will 'just work' as the pattern
matching will ensure the other actor handles it correctly, or it gets
dumped with a message stating no match.

Hm. I don't know much about the Actor model (but I do deal with declarative
languages which similarly have no globals or pointers, so I get that
concept).

It seems like your language is very high level indeed (it almost sounds
dynamic). If you can pass arbitrary messages without needing to statically
know the type of anything, and pass data transparently from one machine to
another.

If it's indeed dynamic, then you probably want to implement a minimal
dynamic runtime system in LLVM - have each value wrapped up in a struct
which also contains some small piece of runtime information which holds its
type. For example, along with each int (in an i32), store an i1 (bool) which
specifies whether it is signed or unsigned, and have all relevant operations
(such as division) depend upon that bool. Obviously this makes the language
a lot higher level.

It is in no way dynamic, everything is compiled down. All Actors are
fully compiled code, each could be akin to its own Process. Each
Actor has its own event loop, parses its own messages, etc...
When a Message is sent you just build up a structure and send it off.
At compile time the structure is stuffed into a bitstream with some
metadata at the front describing the structure (it uses 4 or 5 bits
per structure 'element', so there is a little overhead, but it is very
little). The Message is then sent off and the receiving Actor will
receive it in some amount of time (the Actor model uses unbounded
nondeterminism, yea, try imagining that), when it does receive it then
in the event loop it parses it like this (this is more erlangish
syntax since I have not finalized my syntax for the receiving loop in
my language):
receive ->
    {"Calc", MyDataType newData, PID sender} -> // calculate something
from newData and send the 'Resp'onse back to the sender
        sender.sendMessage({"Resp", SomeFunc(newData)});
    {"Get", PID sender} -> // they just want the current data status
sent back to them as a 'Resp'onse
        sender.sendMessage({"Resp", currentData});

Where the {"Calc", MyDataType newData, PID sender} structure will
match any structure that has three elements, where the first is an i8
array of 4 elements of specific variables (specifically matching
"Calc", you could just as well use an integer or anything...), the
second element is of type MyDataType (maybe an i32 for example?), and
the third element is of a PID type (a link to another actor). You can
guess how the second one is matched. But at compile time this whole
structure is compiled down to its base type and builds up a comparison
tree of matching elements (which in this case the first element is the
only 'compared' one and the others are just matched based on type).
If a match is successful then the variables of the Type matches passed
in (if no variable is specified then the Type is still matched at that
point, but nothing is set) are set to the values of those positions in
the message, so if the sender originally sent a message like:
anActor.send({"Calc", i32(18), self}); // builds a struct of { [i8 x
4], i32, {someOtherStructThatRepresentsAPid} }, which is compiled down
to a bitstream and sent out. If the receive loop above matched
something like:
    {i8[4] theName, MyDataType, PID}
Then you could get what the array is as theName, and ignore the other
two elements. If a message is not matched with anything right now (an
Actor can have many such receive switches for receiving different
messages at different times) then it is stored in a queue for the
Actor (not really a Queue, in the Actor world a new Actor would then
be created that does nothing but hold the message, resending it out
every little bit, and Actors can be 'link'ed in such a way so that if
one dies all 'link'ed Actors die too with a death message, which can
be handled to do some cleanup, but cannot be ignored).

In reality, Messages are not actually compressed down to a bitstream
unless it needs to be, if the Actor is on the local machine then the
Message is handed over directly to it if the Actor is not in a running
state, if it is in a running state then a new Actor is created to keep
trying to send it to it and sent to the scheduler to send it periodic
updates to keep trying, so all of this is basically just a copy of a
pointer size to the message struct, else if the PID is to something
that cannot be passed directly, like another cpu on a non-shared
memory system or another process or over the Internet, then it is
bitstreamed up and sent to the handler for the destination (another
Actor that just takes a bitstream in a message, figures out the remote
PID for it, sends it off to a corresponding handler on the other
machine, maybe even hopping across multiple machines like router, and
is eventually de-bitstreamed and passed to the receiving Actor.

So there is matching involved (like many dynamic functional languages
used), but it is all resolved at compile time, all comparison and
matching functions become base instructions that test for each type,
usually with a single test skipping any specific one (the first
element will probably just become a switch test since the majority of
the time, in a well made Actor system, that is the only part that has
to be compared to determine what a message is, if they are smart and
use integers, or an array of chars that will fit into a machine word
so an integer compare can be used, like "Calc" becomes the size of an
i32, nice and fast compare).

The Actor model has been pretty well researched and really only sees
heavy use in the Telecom industry (which relies on 100% uptime, quite
literally, the flapship Erlang system is the system that runs
Ericcson, it has had 4 minutes of downtime in over ten years thanks to
it being well designed; quite literally, if code needs to be updated
then can just start a new node and link it to a global PID to handle
all messages of whatever Actor and send a kill message to the old
Actor, which will die once it finishes handling the messages it
currently is dealing with; when an Actor dies it can send a kill
message to other Actors that registered themselves as such a listener
with an Actor; and by default, in Erlang, when an Actor creates
another Actor they are auto linked, you have to explicitly unlink them
if you do not want one dieing one to bring down the whole set, but by
explicitly unlinking them you know where you are setting safety bounds
in the system so you can register an Actor to, for example, do nothing
but just wait for a death message from such a 'system' of linked
Actors, and if they die record the death reason and re-construct it as
necessary, usually called Monitor or Guard Actors, depending on their
exact purpose).

It is actually a very interesting way to program in. I learned Erlang
about a year back, and loved the style (although I hate the
'functional' programming), and I could find no other language that
used the Actor model (the Object-Oriented model took over because of
C++ and language like that, kind of killed off the Actor model except
for Erlang). Erlang is also an utter horror to integrate with C++
applications as the C++ application has to 'pretend' to be an endpoint
node, handling all of the special Erlang types, doing the pattern
matching explicitly, etc... I already have a setup API for
integrating my language into a C++ app directly so functions can be
registered and so forth (my preferred way). So now my Actor language
can call exposed C/C++ functions directly, so, if it was a game for
example, there could be an Actor that just takes messages from all the
'object/Actors' of the game world about their position and other
rendering updates, and passes that to the Renderer to tell it to
update. There can be Actors that handle specific Zones, a whole
hierarchy of message passing, where things only handle things that
need to be handled. The nice thing about this model, for example if
you implemented an MMO server in it (*hack*cough*), it could be kept
up persistently, no downtime needed for patches, the single system can
handle the entire game world, regardless of the amount of players. If
you need more server power you can literally just toss on another few
computers, link them to the system, and let some Monitor Actors
suddenly notice that there is a lot of unused CPU power on such
systems so to start sending code over to be compiled and some Actors
started. If the new servers have access to the outside then the
Monitors can start up some more Actors on those machines to handle
more players and have them register with a single login system so they
can handle some more player flow. Have a good Fiber network back-end
and you have a robust, scalable, fast system.

Even just on a single computer, a program made in the Actor model can
scale to any number of CPU's, so it is 'future-proof' with how CPU's
are advancing. Even if CPU's have different capabilities (and LLVM
supports compiling to those other sets) then you can have specialized
Actors running on them (like the Cell processor with its little
secondary PPU's with non-shared memory, but using a Message Passing
central bus, the Actor model represents these style CPU's perfectly,
unlike C++).

Er, I think I made this too long. Either way, nothing is really
dynamic in this setup, everything is pretty well set-in-stone at
compile time. I am mostly making the language for myself, but a
couple have shown their interest in it so I am trying to overall make
it easier to use for others, rather then just me. I did not describe
a great amount of details about how Messages are handled, but I would
guess you could glean most of that from what was already stated.

As you can see though, it is best to have only built-in system types.
Letting the user create custom types means that the Bitstream encoding
becomes more complex, hence slower, and it means more code has to be
linked in, meaning if a new node is setup then even more code has to
be sent over, rather then just a file containing how an Actor works.
Currently I do support an "alias" keyword that can map a complex type
to a simple name, but something made with that name and something made
with its base type are identical in all operations, not treated any
different, just like the typedef keyword in C/C++ (unlike the typedef
keyword in D, which actually does make a completely new non-compatible
type, my alias keyword works like typedef in C/C++ or the alias
keyword in D) for ease of typing.

Anyone feel free to pick this all apart though. Any problems, bugs,
bad designs, etc... all need to be figured out before I get too far
into this.

WRT: Message-based concurrence, you might what to check-out the papers here:

  Sriram Srinivasan

I have not ran across that site before. Thank you. Any additional
ideas and implementations are always welcome.