Sawzall - a popular language at Google

Interpreting the Data: Parallel Analysis with Sawzall

"The query language, Sawzall, operates at about the level of a type-safe scripting language. For
problems that can be solved in Sawzall, the resulting code is much simpler and shorter – by a
factor of ten or more – than the corresponding C++ code in MapReduce."

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

MapReduce

Is this basically the same as this?

Yah.

Albeit with a pleasantly Unix-derived scripting syntax.

Another one of those non-innovative "inventions" that leverage the stock growth of smoke and mirrors.

I'd Agree...

...except Google isn't claiming this is innovative or any particular kind of "invention" other than an in-house tool that's easier to use than the C++ API to MapReduce. From that point of view, it's a pretty clear win: non-C++ programmers can knock out various kinds of analyses that are useful to them, and do so very, very quickly.

I agree.

Though lately the argument that something is "easier to use" is starting to annoy me greatly. I think the time has come to give some more formal justification. :)

From the paper: The query

From the paper:

The query language, Sawzall, operates at about the level of a type-safe scripting language. For problems that can be solved in Sawzall, the resulting code is much simpler and shorter – by a factor of ten or more – than the corresponding C++ code in MapReduce.

Formal or no, that's a pretty convincing comparison. On a bit of a tangent, though, is there a formal way to determine ease-of-use? The term "can of worms" comes to mind... :)

No, it isn't.

I can get code up to 3 times shorter simply by clever use of gzip! (Or better yet, by compressing complicated and long variable names into easy-to-use two-letter sequences.)
:)

My trollish opinion on the matter is that questions such as these can only be answered if we strictly and formally define our computational model first.

Hurrah

I can get code up to 3 times shorter simply by clever use of gzip! (Or better yet, by compressing complicated and long variable names into easy-to-use two-letter sequences.)

Or the proud Perl tradition of just squishing everything into 80-character lines of pure code, yes. The point is that it is clear that its syntax is not complex, certainly not more complex than C++, so the comparison of languages is a little more justified. We are falling into very fuzzy arguments around here, though.

It does look like the designers of the language treated the design with an appropriate level of strictness, though. They wanted a statically-typed language which works well with the sort of processing typically applied to MapReduce, and they created that.

is this really the best way?

It seems given the slow network that you would want to push as much computation on the cluster nodes as possible. This seems to just select some data and send it back the aggregators.

I have also wondered how this approach works for more traditional OLAP applications? Wouldn't it be better to run a database over something like the terragrid?

dAWK

Feels like "distributed AWK" right?