archives

hashing facts (the secret of nnimh)

My intention here is to start a topic on "entertaining hashing facts", in case a programming language decides to build some particular hash algorithm into a module with persistent impact (in a manner obligating backwards compatibility). In the spirit of proverb measure twice and cut once, such decisions out to be informed by reliable facts instead of guesses. I wonder if others know things useful in planning PL effects.

It's possible there's not much interesting to say, in which case the discussion goes nowhere. That's okay. But here's an example of what I mean. A few months ago I was thinking about putting a checksum in a persistent file format, where it was going to be important that a PL portably knew how to do the right checksum. The 32-bit checksum used by gzip has a polynomial specified by RFC, which appears in the penultimate four bytes of a gzip file format, in little endian order (followed by uncompressed length in the final four bytes, also in little endian order). While you can just use crc32() from the zlib library for this, what if you didn't, perhaps because a 'better' version is dictated by local code rules? How can you informally check the right checksum is generated for a standard input?

I wrote a search over four and five letter lowercase sequences, followed by a newline, to find crc checksums with a memorable looking hex pattern, by printing those with patterns like anagrams in hex nibbles. I was really lucky, because I found an input whose hash has the same eight digits in a row, so big and little endian order look the same.

% echo nnimh > nnimh.txt
% hexdump nnimh.txt
0000000 6e 6e 69 6d 68 0a                              
0000006
% gzip nnimh.txt 
% hexdump nnimh.txt.gz 
0000000 1f 8b 08 08 80 e2 cb 55 00 03 6e 6e 69 6d 68 2e
0000010 74 78 74 00 cb cb cb cc cd e0 02 00 ee ee ee ee
0000020 06 00 00 00                                    
0000024

As you can see, the value of crc32(0, "nnimh\n", 6) is equal to 0xeeeeeeee, and this is easy to check on a command line when an environment has ready access to gzip and hexdump. As a secondary bonus, to me the pattern has a mnemonic, because it reminds me of a Disney movie, The Secret of NIMH.

Maybe discussing hashes and checksums is a dumb idea, in which case I apologize for bringing it up. I guess I'm hoping folks will avoid an argument about the fastest or most effective hash, which are things easy to measure empirically if you write a careful test. (For example, in a hashmap resolving collision by probing, a hash is best when unused slots are distributed uniformly, which you can measure as standard deviation of distance between non-adjacent empty buckets. I don't think intuitive opinion has a place in arguments based on empirical tests.)