hashing facts (the secret of nnimh)

My intention here is to start a topic on "entertaining hashing facts", in case a programming language decides to build some particular hash algorithm into a module with persistent impact (in a manner obligating backwards compatibility). In the spirit of proverb measure twice and cut once, such decisions out to be informed by reliable facts instead of guesses. I wonder if others know things useful in planning PL effects.

It's possible there's not much interesting to say, in which case the discussion goes nowhere. That's okay. But here's an example of what I mean. A few months ago I was thinking about putting a checksum in a persistent file format, where it was going to be important that a PL portably knew how to do the right checksum. The 32-bit checksum used by gzip has a polynomial specified by RFC, which appears in the penultimate four bytes of a gzip file format, in little endian order (followed by uncompressed length in the final four bytes, also in little endian order). While you can just use crc32() from the zlib library for this, what if you didn't, perhaps because a 'better' version is dictated by local code rules? How can you informally check the right checksum is generated for a standard input?

I wrote a search over four and five letter lowercase sequences, followed by a newline, to find crc checksums with a memorable looking hex pattern, by printing those with patterns like anagrams in hex nibbles. I was really lucky, because I found an input whose hash has the same eight digits in a row, so big and little endian order look the same.

% echo nnimh > nnimh.txt
% hexdump nnimh.txt
0000000 6e 6e 69 6d 68 0a                              
0000006
% gzip nnimh.txt 
% hexdump nnimh.txt.gz 
0000000 1f 8b 08 08 80 e2 cb 55 00 03 6e 6e 69 6d 68 2e
0000010 74 78 74 00 cb cb cb cc cd e0 02 00 ee ee ee ee
0000020 06 00 00 00                                    
0000024

As you can see, the value of crc32(0, "nnimh\n", 6) is equal to 0xeeeeeeee, and this is easy to check on a command line when an environment has ready access to gzip and hexdump. As a secondary bonus, to me the pattern has a mnemonic, because it reminds me of a Disney movie, The Secret of NIMH.

Maybe discussing hashes and checksums is a dumb idea, in which case I apologize for bringing it up. I guess I'm hoping folks will avoid an argument about the fastest or most effective hash, which are things easy to measure empirically if you write a careful test. (For example, in a hashmap resolving collision by probing, a hash is best when unused slots are distributed uniformly, which you can measure as standard deviation of distance between non-adjacent empty buckets. I don't think intuitive opinion has a place in arguments based on empirical tests.)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Always use a proper hashing algorithm

Edit: I guess I missed the part where you said you don't want to discuss measurements. However I think the tests they use are interesting :-)

For a checksum I would always use a proper hashing algorithm like SHA256, I would only use a quick hashing algorithm for a hash-table. I really hope there are no amusing letter sequences occurring in a crypto hash function.

This looks interesting:
http://research.neustar.biz/2012/02/02/choosing-a-good-hash-function-part-3/

Looks like the "city" hash algorithm passed all the tests and was fast.

Also interesting:
http://burtleburtle.net/bob/hash/doobs.html

proper vs amusing

I wanted to provoke comments about things new to me, rather than create an excuse to talk myself, so I'll try to be brief. (The more I talk, the more bored I get, because I don't find myself interesting. The only exceptions are when I don't know what I'll write, which happens in complex exchanges.)

If space to store a checksum is constrained (pick a reason), having a lot more bits is not especially helpful, if generating them is slower and this happens often enough to change how CPU-bound something is. (Increasing latency for no value gained is seldom a win.) When you have space to keep a lot of result bits, a cryptographic hash is good. And sometimes if you want to chop a bigger cryptographic hash into pieces, say for multiple bloom filter hashes, this can be cheaper than multiple fast hashes, though each hash gets less of the total entropy.

(You can write a test using bloom filters that demonstrates a crc32 retains more of the input entropy than a 32-bit slice of a bigger cryptographic hash, which causes fewer than the ideal number of one bits to be set. In both speed and entropy retention bakeoffs, crc32 does well. It just has crummy attack resistance, which may not apply to some uses.)

If looking for amusement, you can take a large proper cryptographic hash, then reduce it to 32 bits, either through XOR-folding or by slicing out a 32-bit subset. The pigeonhole principle says patterns must occur for some inputs, with probability determined by frequency of things judged to be patterns out of the whole possible population. Run enough inputs through, and some will be (slightly) amusing. Hex digit anagrams are not very amusing, unless you can get ascending or descending odd or even digits. You can calculate how many string inputs must be tried (in your alphabet of choice) before odds of getting a desired pattern hits high enough probability to be likely. Note I don't find this interesting enough to banter about, but maybe it will make someone remember something.

Most people seem to have awful intuition about probability, which makes them not quite grasp nuances of hash properties, causing odd conversations that sound more like religion than science. It's not politically correct to say at times cryptographic hashes are bad fits, or that at times design patterns in oo languages are bad fits. I like hearing about things I can't figure out from first principles; conversations about principles are boring because I can figure out their consequences.