Judy Stores

A Judy tree is generally faster than and uses less memory than contemporary forms of trees such as binary (AVL) trees, b-trees, and skip-lists. When used in the "Judy Scalable Hashing" configuration, Judy is generally faster then a hashing method at all populations.

Some of the reasons Judy outperforms binary trees, b-trees, and skip-lists:

  • Judy rarely compromises speed/space performance for simplicity (Judy will never be called simple except at the API).
  • Judy is designed to avoid cache-line fills wherever possible.
  • A b-tree requires a search of each node (branch), resulting in more cache-line fills.
  • A binary-tree has many more levels (about 8X), resulting in more cache-line fills.
  • A skip-list is roughly equivalent to a degree-4 (4-ary) tree, resulting in more cache-line fills.
  • An "expanse"-based digital tree (of which Judy is a variation) never needs balancing as it grows.
  • A portion (8 bits) of the key is used to subdivide an expanse into sub-trees. Only the remainder of the key need exist in the sub-trees, if at all, resulting in key compression.

The Achilles heel of a simple digital tree is very poor memory utilization, especially when the N in N-ary (the degree or fanout of each branch) increases. The Judy tree design was able to solve this problem. In fact a Judy tree is more memory-efficient than almost any other competitive structure (including a simple linked list). A highly populated linear array[] is the notable exception.

Worth a look, considering the open-source LGPL license and range of applications. HP released this library in 2004. From the PDF manual,

Judy arrays are remarkably fast, space-efficient, and simple to use. No initialization, configuration, or tuning is required or even possible, yet Judy works well over a wide dynamic range from zero to billions of indexes, over a wide variety of types of data sets -- sequential, clustered, periodic, random.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Holy engineering complexity, Batman!

Everyone contemplating using Judy should read Sean Barrett's article at

http://www.nothings.org/computer/judy/

Sean's even been known to post to LtU occassionally (Hi Sean!).

Backhanded compliment

Thanks for that one. Both articles are worth reading. I am always concerned with bit-packing tricks, here the low bits of pointers. Yuk.

The critique gives an impression of grudging admiration with the main complaint that Judy meets all its claims but fails to prove them well.

Hrm

That's not really what I take away from the critique. I think it's easy to agree that Judy is an impressive piece of engineering. But that's exactly the problem: as a developer, would you really want to include 20,000 lines of carefully performance-tuned code written by someone else into your project? When there is a simple 200 line alternative that performs better in the operating range that most people care about? The hash table solution has slightly higher memory overhead but I think the trade-off should still be pretty clear to most people.

I mean, even if Judy has been used by several projects over the years, I still have to assume that if I use it in a project, then I _might_ be faced with having to fix bugs in it at some point. Pretty scary/risky if you ask me. But maybe I'm just paranoid. :)

OS calls? Stdlib?

So do you refrain from ever using functions defined in the std c library? What about OS system calls? Do you rewrite printf because someone else might have implemented in incorrectly?

If you do GUI programming things are likely even worse. You probably call tons of library functions to manipulate data or display things on the screen. All of these functions could be wrong and you could reimplement them yourself in a way you could get to and fix.

Using a threading library is a good example. Here you are choosing to use what is often a very complicated scheduling algorithm when you could no doubt implement a much simpler thread manager for just your program with not too much performance degrade.

Ultimately we trust alot of code we don't write and I hardly see what difference it makes if that code is compiled into our project (supposing it is nicely written in standard C) or if it is included in the OS or std libraries, or even automatically generated by a fancy GUI designer. Sometimes this code is quite complicated to make performance gains but we don't insist on reimplementing things in a more simple way in our program because they might fail. Heck at least in Judy you *could* address a bug you find which you really can't if a std shared library messes up.

In short it seems we should apply the same standards to code like this as we do to common shared libraries.

Sorry... But...

... would you please consider changing your nick?

WOT: Real idendities are to be encouraged...

...as it tends to promote better discussion, as well as respect for peers.

Real names

I tend to igonore posts that come from strange or cypritc acocunt names, for this reason.

When I tried to decide on policy regarding real names people explained why they are reluctant to use their real name on an archived web forum. I respect their feeling, and this is why we never formulated a policy banning this sort of acocunt name. However, I urged people who use such account names to realize that if they want to be a part of the community, and to be taken seriously, they need to use the same user name over time, and give people time to realize they aren't simply trolls.

If so inclined, they can also post a short hello message giving their general background, so others can know who it is they are conversing with.

Real name and Comments

I really fail to see what difference using a 'real' name makes. Names serve as an identifier so you can rate the trustworthiness, reliability and other features of some individual but consistant use of handles can do exactly the same thing. Is there really some benefit to using a name of the form first name last name?

On an internet discussion such as this there seems to be a fairly small chance that people who post here are going to know me personally so seeing my real name (Peter Gerdes which is now on my userinfo page) doesn't offer people any usefull information. However, there is a fairly good chance they have seen me post with my preferred username (logicnazi) at other sites. So in fact my handle is probably more informative than my real name. While one might choose a handle that one never uses elsewhere and use it to avoid taking responsibility from one's opinions one could just as easily give a false (or even worse misleading) 'real' name. In fact if you do a search for my real name on google you get a bunch of other people who share the name and little usefull information about me. On the other hand if you search for my handle the top 5 hits are on me and include other discussions in which I have participated.

However, I think this is all besides the point. One of the great virtues of the internet is that it lets us see the ideas of others without being biased by too much information about them. It shouldn't matter whether I am a thirteen year old, a distingushed professor of computer science or even a monkey. If my post makes compelling points and arguments you should listen to it no matter who it comes from. If it doesn't you shouldn't.

So while I agree that lack of proper identification can cause problems (trolls spammers etc..) this is a problem with authentication not the type of handles choosen. If you establish a policy of filtering and only allow handles of the form first name last name (either explicitly or implicitly by ignoring other posts) those people who want to abuse the system will simply comply with false names meaning you will only succeed in filtering out the genuine contributors who don't want to use first names.

Am I missing something here? Who benefits if I use my real name rather than a consistant internet handle?

Perhaps the "nazi" part disturbs people

I don't want to put words into anyone's mouth, but my impression is that the original objection was to your specific pseudonym, not to pseudonyms in general. The word "nazi" carries a lot of negative connotations, and associating yourself with the Nazis is perhaps not the best way to present yourself.

Missing the Point

Per Vognsen: But that's exactly the problem: as a developer, would you really want to include 20,000 lines of carefully performance-tuned code written by someone else into your project? When there is a simple 200 line alternative that performs better in the operating range that most people care about? The hash table solution has slightly higher memory overhead but I think the trade-off should still be pretty clear to most people.

IMHO, this misses the point of Judy Arrays:

"No initialization, configuration, or tuning is required or even possible, yet Judy works well over a wide dynamic range from zero to billions of indexes, over a wide variety of types of data sets -- sequential, clustered, periodic, random."

In other words, Judy isn't for "the operating range that most people care about." It's for extremely demanding domains with radically-dynamic conditions that can't necessarily be predicted at compile time and therefore tuned for. So I agree with everyone who says that a simple hash table should probably be used in most cases, but for those other, harder cases...

Awesome

I saw this a while back, but haven't had a use yet. It's like dynamic programming applied to associative arrays.

Simplicity is a practical concern.

The comment that saving space can also save time is very true. On real machines, the space/time tradeoff is considerably more complicated than the "space * time = constant" idealization. Any attempt to use more space in order to save time needs to be tested in reality.

On the other hand, I disagree with the statement that simplicity is only of interest to theoreticians. In the practical world it is one of several concerns, though admittedly simplicity is of greater relative importance to theoreticians.

In many (most?) cases, the efficiency of a hash table is not critical to a program, but correctness *is*. In general, less code = less bugs. Not to mention it's a heck of a lot easier to review and/or verify a simple hash table implementation than that one that takes 20,000 lines of code. I'll bet that there are at least 10 bugs in Judy's code.

I'm also bothered that the source code is available, but the explanation of how it works is not.

In all, the claims are reasonable, and in the right context Judy trees might be a good choice..

Only Apparent simplicity matters!

Sure simplicity is a good deal but the entire point of modern programming is that one can manage complexity through abstraction. Thus what really matters is apparent simplicity, is the interface to Judy trees simple and easy to use.

Using Judy trees is just like using std library functions for sin and cos rather than writing a simpler but slower version for yourself. In short I hardly see why it matters if it is compiled in or seemlessly integrated into your compiler distribution (see my earlier posts for more points).

So long as we have good evidence the code is reliable and lives up to the claims I hardly see why this matters. Just put it in it's own directory and don't touch it. Simplicity can't always be better than complexity otherwise you would insist on returning to a kernel with a really simple scheduler, or insist your program didn't make use of things like write-on-copy.

Taking Sides

I haven't RTFA (just skimmed through). It is certainly feasible to access a 20k-line library through a simple API.

My concern would be that from what I read in the comments, it doesn't wipe the floor with the 200-line version. In that case you have to assess what the cost of using Judy will be. Is it cheaper to rely on Moore's law ? How much are you going to spend on maintaining your copy of Judy, educating people, porting it to different languages, etc. Inherent simplicity may seriously matter.

It's just not fast enough to be worth the risk

I think the parent makes a critical point.

The reason we use the STL quicksort over a nice simple bubblesort is that QS stomps it into the ground on non-trivial datasets. Because stl::sort is in a standard library it has been tested much more extensively than Judy. We use windowing and OS libraries both because they are standardized and therefore relatively tested, and because we don't much of a choice in the matter. But we do have a choice with Judy, and that's an awful lot of untested code (relatively speaking) versus a small performance increase.

I'm suspicious that Judy is even as hot as the article indicates -- 20,000 lines is a lot of code. I don't know how long Judy's hot paths are, but I'd be willing to bet they're longer than the hot paths through the 200 line hash table code. That may not matter in a small benchmark program -- there's plenty of room in the various caches to keep the pipelines full. But I'm doubtful that Judy could maintain that performance when there's large amounts of other code contending with it in the caches. I could be wrong. I've seen (and written) very large codes that nonetheless were very quick indeed, because the hot paths were very very short, and all the other lines handled very rare (but complicated) special cases. Doubtless a careful examination of Judy's code would reveal its hot paths. But I'm not terribly interested in tackling that much code if the best it can offer is a 2x best-case improvement over a naive hash table.

Amount of Testing

I agree the concern over hot paths and instruction caches could be a real one (depending on what the rest of the projects code looks like). However, supposing you do real world tests and it turns out that Judy is actually quicker than using a hash table. Then it no longer matters that it uses 20,000 lines of code as the extra memory hit for your code is going to be quite small both in absolute terms and relative to the space of your data in any application this might matter.

I also agree that how much the code has been tested makes a big difference. I even agree that 20,000 lines of code requires more testing than something that does the same thing in 200 (more possible ways it can go wrong). However, this is hardly an insurmountable problem. If the code base is widely used and shipped in high end HP products I would be more inclined to trust it than alot of library or template code.

Sure perhaps STL quicksort has been very well tested (though one does need to distingush between different implementations) but plenty of other functions in the STL haven't and we know that some of them are even incomplete or just buggy. Code used in GUI applications is often even worse. The evidence from windows shows that many of its library functions or calls do have bugs yet few people insist on doing everything as low level as they can to avoid these bugs. Moreover, if your standard was really that you wouldn't use new library calls or similar code until it had passed tests proportional to the amount of code involved you would have to refrain from using any new OS feature. Even worse you would have to stop using that library feature each time the backend was reprogrammed, e.g., if you discovered the scheduler in your threat library had been rewritten to have 10x more code but faster with better behavior would you insist on moving back to your own internal simple scheduler?

So not only do we use long segments of library code to do things efficently when they could be done more simply --often without proportional testing-- I suggest that we are correct to do this. The worry is of course that down the line a bug will be discovered which impacts your program and will need to be fixed. If this code is something you wrote then you will need to track it down and fix it. However, if the code is part of a library or OS you often get these bugs fixed for free and thus potential errors in your program are often fixed even before you become aware of them (i.e. the weird library call cases are fixed).

Thus if you believe a particular library function is supported by a dedicated group of coders you are actually better off using the function (even if it is internally complicated) then rewriting a simple version yourself. Exactly the same holds true when you compile the code into your program. If you think that judy will be supported (either by the developer or an OS group) then you can fix bugs simply by downloading an updated source tree and plugging it in. On the other hand though if you think the project may go dark all your concerns are justified.

Complexity Analysis?

Is there a complexity analysis of the "Judy tree" algorithm(s)?

For example, I see references to quicksort as if that were an efficient algorithm. While quicksort is average case O(N*log(N)) but it is worst-case O(N**2) and so not preferable to other sorts such as heapsort [worst-case O(N log(N)] if you need a result within a specified time. If you've spent significant time in a shop that uses quicksort, there will always be a nightmare story about the run in which quicksort didn't terminate.

Before investing in Judy trees I'd want the worst-case and average-case complexity analysis.

256-ary Morphing Digital Tree

This is how I understood it worked after glancing over it.

Seems it is a (kind of) digital tree where nodes in the tree have size 1..256 pending the number of entries in each node; the datastructure for the node varies with size. This would give O(log_256 n), n is the key, worst case insertion and lookup. Hardly any need to balance if n consists of 32, or 64, bit keys since the depth of the tree is then at most 4, or 8. Of course, there are pathological cases if the index is a string of bytes; or, I guess, if there is a lot of sharing on the prefix of the key.