How to deal with potentially unlimited count/length symbol names?

In my language I have anonymous types (essentially tuples), and I have
generated functions (like constructors) which are unique for these
types. If the same type occurs in multiple modules however it should end
up with only one definition being linked. Thus I need a way to give them
the same name.

The problem is that if I derive the name from what the type contains the
length of that name is essential unbound. So how does one generate
names? I'm thinking of just using a long hash and hoping I don't get
accidental collisions. Surely there must be a better way?

Currently, since I'm only dealing with one module, it is very easy to
just assign unique numbers. But obviously this doesn't work with
multiple independent modules since they'd all need the same name. It
will ultimately have to work across libraries as well, so I can't just
create a registry of the type->id.

I think you've covered all the possible implementations.

In terms of just generating long names, LLVM and common platforms can
handle long names reasonably well because C++ often uses such names. Also,
the Itanium C++ ABI has a scheme to compress repeated uses of the same type
which might be of interest; see
http://mentorembedded.github.io/cxx-abi/abi.html#mangling-compression .

In terms of a registry, you might want to consider whether these helpers
actually need to be exposed across libraries.

-Eli

Annoyingly, the larger the type the more important it is to share -- for
small types everything will just be inlined so it doesn't matter.

Any idea on what the limit of a name can be? I'll try a compression like
system as well, but I will likely have to truncate at some point (where
I can add a hash).

IIRC, there isn't any absolute limit. You'll probably run into performance
issues if your symbols get beyond a few kilobytes, though.

-Eli

Just a cryptographic hash (e.g. SHA1) to avoid the need to "hope" that
there are no collisions.

-- Sean Silva

From: llvmdev-bounces@cs.uiuc.edu [mailto:llvmdev-bounces@cs.uiuc.edu] On Behalf Of Sean Silva
Sent: Wednesday, June 19, 2013 11:45 AM
To: edA-qa mort-ora-y
Cc: <llvmdev@cs.uiuc.edu>
Subject: Re: [LLVMdev] How to deal with potentially unlimited count/length symbol names?

> The problem is that if I derive the name from what the type contains the
> length of that name is essential unbound. So how does one generate
> names? I'm thinking of just using a long hash and hoping I don't get
> accidental collisions. Surely there must be a better way?

Just a cryptographic hash (e.g. SHA1) to avoid the need to "hope" that there are no collisions.

-- Sean Silva

Cryptographic hashes don't guarantee you get no accidental collisions;
their goal is to make it super hard to produce a collision _on purpose_.
What you need is an algorithm designed for string inputs, with good
uniformity, and an adequate output size; there are many.

Accidental collisions are essentially the Birthday Problem:
http://en.wikipedia.org/wiki/Birthday_problem
See particularly the end of the "Square approximation" section, which
specifically discusses the application to hashes, relating probability
of collision to bit-width and number of inputs.
For example, I worked out that with a 128-bit hash you need around
2^50 inputs before you get a collision probability of 1-in-a-billion.
(For comparison, a 64-bit hash gets you about 185,000 inputs for the
same collision probability.)

If you want a "name-brand" algorithm, I'd suggest MD5 over SHA-1.
It produces 128-bit output (versus 160-bit) and the fact that it is
"cryptographically broken" is irrelevant to your use-case.

--paulr

It's obvious by the pigeonhole principle that there must be collisions. The
point of my statement that OP doesn't have to "hope" that it avoids
collisions: crypto hashes are designed (and depended on across the globe)
to make collisions vanishingly unlikely; it's orders of magnitude more
likely that a cosmic ray will silently corrupt the hash computation than
for two strings to collide (except possibly for maliciously-chosen strings
for weaker hashes).

-- Sean Silva

if youd don’t care the readabilit, you can compress the function name…

在 2013-6-20 上午7:22,“Sean Silva” <silvas@purdue.edu>写道:

Please do not conflate the uniformity property (vanishingly unlikely to
trip over collisions by accident) that applies to all good hashes,
with the collision-resistance property (computationally difficult to
intentionally produce an input B with the same hash as some input A)
which is the key _additional_ property of secure hashes that are depended
on across the globe.

OP requires good uniformity and an adequate bit-width to get vanishingly
unlikely accidental collisions.
OP does not require collision resistance.

Sorry to be picky but I spent a dozen or so years building secure systems
and this kind of misunderstanding is a bit of a sore point with me.
We can continue this off-list or at the next Bay Area social if you wish.

Thanks,
--paulr