A good hash algorithm

Discussion:

Jason Evans

2002-07-12 21:44:09 UTC

A bit of background: I'm not actually a Lua user, but I've been reading the
list archives for about a year, since someone pointed Lua out to me as
having similar uses to Onyx, a language which I wrote
(http://www.canonware.com/).

The thread here about hashing algorithms interested me, and I ended up
trying out Bob Jenkins's hash function with Onyx. The results were
underwhelming; variations on a simple old algorithm[1] turn out to have
similar runtime performance, and significantly lower collision rates than
Jenkins's. Naturally the results depend on the inputs, but Jenkin's
algorithm was a step backward for Onyx in all of my tests, and it wouldn't
surprise me if the same is true for Lua.

Jason

[1] djb2 algorithm: http://www.cs.yorku.ca/~oz/hash.html

Luiz Henrique de Figueiredo

2002-07-14 14:20:53 UTC

Permalink

Post by Jason Evans
variations on a simple old algorithm[1] turn out to have
similar runtime performance, and significantly lower collision rates than
Jenkins's
[1] djb2 algorithm: http://www.cs.yorku.ca/~oz/hash.html

Lua's hash function for strings is similar to djb2.
The core of djb2 is
h = ((h << 5) + h) + c

The core of Lua's function
h = h ^ ((h<<5)+(h>>2)+c)

Of course, being similar does not guarantee anything about performance, but
we're very satisfied with the overall performance (code size and speed) of
Lua's hashing.
--lhf

Luiz Henrique de Figueiredo

2002-06-18 19:11:58 UTC

Permalink

let the lua authors evaluate the well presented claims on the page.

Perhaps a kind soul has the time for this. If someone comes up with a better
hash algorithm than the one used in Lua and has established that it is better
in Lua, we'd be happy to consider adding it to the code, but please make sure
that it's not slower than the current one.
--lhf

Steve Dekorte

2002-06-18 20:55:27 UTC

Permalink

On Tuesday, June 18, 2002, at 12:11 PM, Luiz Henrique de Figueiredo

Post by Luiz Henrique de Figueiredo

let the lua authors evaluate the well presented claims on the page.

I ran across a paper that found that moving the last accessed entry to
the front of the bucket significantly improved performance for typical
usage cases. Does lua already use this trick?

Cheers,
Steve

R***@oxfam.org.uk

2002-06-18 19:51:16 UTC

Permalink

Post by Luiz Henrique de Figueiredo

let the lua authors evaluate the well presented claims on the page.

Perhaps a kind soul has the time for this. If someone comes up with a

better

Post by Luiz Henrique de Figueiredo
hash algorithm than the one used in Lua and has established that it is

better

Post by Luiz Henrique de Figueiredo
in Lua, we'd be happy to consider adding it to the code, but please make

sure

Post by Luiz Henrique de Figueiredo
that it's not slower than the current one.

I'm not volunteering to do that :)

One of the interesting things about the Lua hash algorithm (aside from the
fact that it doesn't depend on the 32-bitness of integers) is that it is
time-bounded for long strings -- there is a maximum on the number of
characters hashed. This may strike people like the author of the cited
algorithm as inferior, but one needs to balance costs. The "best" hash
algorithm is not necessarily the fastest in actual practice.

Consider the case of a hash-table with 20 keys, each of which is a string
of a million bytes. A "really good" hash algorithm would probably succeed
in creating a unique hash for each string, but in O(1,000,000) time. The
Lua hash algorithm might hash all of them onto the same value; the cost of
that would be that the table lookup would be essentially linear, or O(20).
In practice, scanning 20 table entries on each lookup is probably faster
than computing 20 hashes of a million bytes, unless there are a lot of
lookups.

Here's my S/0.02 contribution to Lua hashes, though (and I haven't looked
at v5 yet, so ignore me if you've done this):

Once a string is interned into the string hash table, its address becomes a
unique identifier. So there are two alternatives: use the hash as
originally computed as the string's lookup hash, or use the address as with
any other allocated object. If the original hash is sub-ideal, the address
may be a better lookup hash.

The objection might arise that there are not many random bits in a malloc'd
address. This is true: neither the high-order nor the low-order bits are
random. However, there are probably as many random bits as the log of the
number of elements in the largest table, because objects will be randomly
scattered around allocated memory, and allocated memory is larger the the
largest table. Consequently, dropping the last four bits or so of the
address and then using the rest as a hash value is likely to be pretty
good. (If I'm not mistaken, currently three bits are dropped, but four
would be better for malloc's that like to 16-byte align things.)

Rici

Steve Dekorte

2002-06-18 20:53:29 UTC

Permalink

Post by R***@oxfam.org.uk
Consider the case of a hash-table with 20 keys, each of which is a string
of a million bytes. A "really good" hash algorithm would probably succeed
in creating a unique hash for each string, but in O(1,000,000) time. The
Lua hash algorithm might hash all of them onto the same value; the cost of
that would be that the table lookup would be essentially linear, or O(20).
In practice, scanning 20 table entries on each lookup is probably faster
than computing 20 hashes of a million bytes, unless there are a lot of
lookups.

But don't you still need to do a O(1,000,000) string compare on
potentially each of the 20 strings?

Steve

Paul Hsieh

2002-07-02 19:08:35 UTC

Permalink

Sorry for being a Johnny come lately here, but I have just read this.

I have used the Bob Jenkins hash function (cited in the link:
http://burtleburtle.net/bob/hash/doobs.html) in production code, and read at least
one paper by the author which gives a very detailed explanation of it (I lead to it
because a coworker discovered it was being used and endorsed in another commercial
product). My assessment of this algorithm is that it is excellent (though not
perfect, as the author indirectly admits -- I don't have the time to go pursue the
mammoth effort that would be required to make it that extra 2% better.)

One must be clear about what it is though. Here's my synopsis of it:

1) What is described is a hash *MIXING* function. It does not describe what the
composition of the hashed object must be. You just have to break down your object
into some collection of byte blocks, be it an address or some length limited string.

2) Its the fastest algorithm I have ever seen that produces this quality of output.
While it is possible to make faster hashing functions, this can only be achieved if
the size of the objects is smaller than 12 bytes, and range of the hash output is
very small (i.e., you're hashing to an 8 bit index or something) or if you are
willing to sacrifice quality (i.e., increase the odds that two inputs map to the
same index.)

Of course you don't have to use the code verbatim. In fact for small objects, I
would recommend specialized versions that would just inline the main mixing function
and not bother with his 12 x 4 byte blocking mechanism.

3) Yes it does require exact bit length descriptions, just like other high quality
hashing algorithms like CRCs/LFSRs.

4) The algorithm has something the author calls an "avalanching" property, which
means that the algorithm will make best use of the input bits. Actually he can only
guarantee this for the bottom 31 bits of the 32 bit output, however I have never
seen a practical application for an output of 32 bits for a hash key (i.e., you
would always mask down the output to something significantly less than 32 bits
anyway.)

Now a comparison with existing Lua code. In lstring.c (Lua v4.x) the hashing
algorithm for strings appears to be:

for (; l>=step; l-=step)
h = h ^ ((h<<5)+(h>>2)+(unsigned char)*(s++));

Hmm ... this function has number "two character poles". For example, let's say that
the table size is 256, and after some part of the string, the value of h is 1. Then
inserting any even number of 'a' characters will map the hash value back to 1
(modulo 256.) There are numerous other scenarios as well: Let's say that the table
size is 1023, and after some part of the string, the value of h is 509. Then
inserting any even number of 'd' characters will map the hash value back to 509.
The Bob Jenkins algorithm has no such weakness -- you would have to search for 12
character sequences to create such repeating poles.

So the code above is fast (for small strings) however it has questionable quality.

Using addresses with bits hacked off for hashing is interesting. However, what it
means is that if you have lots of sequentially allocated objects (which typically
will be of the same size) you'll run into a kind of "all or nothing" risk when
comparing to another collection of sequentially allocated objects. I.e., either
they totally miss each other (ideal) or they completely overlap (worst case
scenario). So when hashes for a pair of non-string, non-value object, collections
are compared, say, you'll typically pay either no hash penalty or the maximum
possible penalty.

My recommendation is that the Lua people at least consider using the Bob Jenkins
algorithm.

Post by R***@oxfam.org.uk

Post by Luiz Henrique de Figueiredo

let the lua authors evaluate the well presented claims on the page.

I'm not volunteering to do that :)
One of the interesting things about the Lua hash algorithm (aside from the
fact that it doesn't depend on the 32-bitness of integers) is that it is
time-bounded for long strings -- there is a maximum on the number of
characters hashed. This may strike people like the author of the cited
algorithm as inferior, but one needs to balance costs. The "best" hash
algorithm is not necessarily the fastest in actual practice.
Consider the case of a hash-table with 20 keys, each of which is a string
of a million bytes. A "really good" hash algorithm would probably succeed
in creating a unique hash for each string, but in O(1,000,000) time. The
Lua hash algorithm might hash all of them onto the same value; the cost of
that would be that the table lookup would be essentially linear, or O(20).
In practice, scanning 20 table entries on each lookup is probably faster
than computing 20 hashes of a million bytes, unless there are a lot of
lookups.
Here's my S/0.02 contribution to Lua hashes, though (and I haven't looked
Once a string is interned into the string hash table, its address becomes a
unique identifier. So there are two alternatives: use the hash as
originally computed as the string's lookup hash, or use the address as with
any other allocated object. If the original hash is sub-ideal, the address
may be a better lookup hash.
The objection might arise that there are not many random bits in a malloc'd
address. This is true: neither the high-order nor the low-order bits are
random. However, there are probably as many random bits as the log of the
number of elements in the largest table, because objects will be randomly
scattered around allocated memory, and allocated memory is larger the the
largest table. Consequently, dropping the last four bits or so of the
address and then using the rest as a hash value is likely to be pretty
good. (If I'm not mistaken, currently three bits are dropped, but four
would be better for malloc's that like to 16-byte align things.)

--
Paul Hsieh
***@pobox.com

R***@oxfam.org.uk

2002-06-18 21:13:54 UTC

Permalink

Post by Steve Dekorte
But don't you still need to do a O(1,000,000) string compare on
potentially each of the 20 strings?

You only need to do the string compare once, when the string is first
interned. After that, the comparison is by address. I'm assuming that a
program which used enormous strings as keys would enter new keys only
rarely compared with the number of table accesses.

Even so, it is highly likely that the difference will be encountered early
in unequal strings. Equal strings will of course require the full megabyte
comparison, but that would be true regardless of the hashing algorithm used
(unless one chose to be allow very-low-probability bugs with an MD5-type
hashing algorithm; personally, I would find that unacceptable, though.)

R***@oxfam.org.uk

2002-06-18 21:33:19 UTC

Permalink

Post by Steve Dekorte
I ran across a paper that found that moving the last accessed entry to
the front of the bucket significantly improved performance for typical
usage cases. Does lua already use this trick?

That was a good question. I checked the v5 code, and as far as I can see,
the answer is no.

However, Lua hashes are not likely to contain long chains (or buckets)
because the key-val pairs are stored inside the hash array itself; a clever
trick is used which allows hash-tables to get quite full without impeding
performance. (This is part of the secret of Lua's success.) It is quite
common for hash lookups to succeed or fail on the first probe.

The only reason I know this stuff is that I was curious about how Lua
managed to do hash tables so fast. It's a really nice job. Congratulations,
guys!

Luiz Henrique de Figueiredo

2002-07-03 14:05:01 UTC

Permalink

Post by Paul Hsieh
Now a comparison with existing Lua code. In lstring.c (Lua v4.x) the hashing
for (; l>=step; l-=step)
h = h ^ ((h<<5)+(h>>2)+(unsigned char)*(s++));
Hmm ... this function has number "two character poles". For example, let's say that
the table size is 256, and after some part of the string, the value of h is 1. Then
inserting any even number of 'a' characters will map the hash value back to 1
(modulo 256.)

I haven't checked it, but have you taken into account that h is initialized
with the length of the string?

Post by Paul Hsieh
So the code above is fast (for small strings)

It's fast for long strings as well, because in this case it does not look at all
the characters.

Post by Paul Hsieh
My recommendation is that the Lua people at least consider using the Bob Jenkins
algorithm.

We're very satisfied with table performance in Lua. Of course, making it faster
would not hurt, *if* it can be done with more or less the same amount of code.
I don't know how large Jenkins's code is.

I propose that you try changing Lua to use Jenkins's algorithm and report the
results here. Roberto probably has testing code for table performance.

Thanks.
--lhf

Paul Hsieh

2002-07-05 00:01:23 UTC

Permalink

Post by Luiz Henrique de Figueiredo

I haven't checked it, but have you taken into account that h is initialized
with the length of the string?

No, that would of course fix the problem. But it just means that the patter for
where the poles are is only slightly more complicated. The larger point is that the
algorithms studied at Bob Jenkin's site have had serious study put into them. The
function you show above is unheard of to me. At the very least you should look at
the "One at a Time" hash function which looks only a little more expensive than the
above, and is apparently competitive with Bob Jenkin's formula in terms of quality.

Post by Luiz Henrique de Figueiredo

Post by Paul Hsieh
My recommendation is that the Lua people at least consider using the Bob Jenkins
algorithm.

Jenkin's code requires that the size of the object is large enough to amortize the
start up cost, but as he explain on his site, his is amoungst the fastest hash
algorithms that doesn't have defects. Is being satisfied with its "performance"
sufficient? You should still be concerned with minimizing collisions too shouldn't
you?

Post by Luiz Henrique de Figueiredo
I propose that you try changing Lua to use Jenkins's algorithm and report the
results here. Roberto probably has testing code for table performance.

If you have a benchmark available that still runs on Lua 4.x, then I'd be happy to
try it out.

--
Paul Hsieh
***@pobox.com

Luiz Henrique de Figueiredo

2002-07-05 00:10:59 UTC

Permalink

Is being satisfied with its "performance" sufficient? You should still be concerned with minimizing collisions too shouldn't you?

I meant that we're satisfied with its overall performance, and I did not mean
just time, I meant collisions too, but I don't have any hard data on this
right now. We did do several collision tests, including looking at how *many*
keys were spread by the hashing algorithms (we once collected *all* identifiers
in *all* C programs available here at Tecgraf and the spread quite well).

Anyway, thanks for letting us know about Bob Jenkin's site.
--lhf