Revisiting AWK

I was dusting off my old copy of "The AWK Programming Language" by Aho, Weinberger and Kernighan (IMHO, one of the best programming books ever written) and decided to see what was being done with AWK these days.

A very interesting networking extension has been added to gawk called Gawkinet. What is particularly interesting is how they managed to add networking capabilities to the language by introducing one new operator "|&" and treating network addresses like files. Leading to examples like:

     
BEGIN {
       "/inet/tcp/0/localhost/daytime" |& getline
       print $0
       close("/inet/tcp/0/localhost/daytime")
     }

Interesting...

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Neat hack

Thanks for pointing that out, Todd. awk might be getting a bit long in the tooth, but the old dog still has some new tricks up its sleeve!

Some strategic sacrifices are made at the altar of simplicity, but always staying true to the spirit of awk. I quote from the manual:

[T]he usual client/server asymmetry found at the level of the socket API is not visible here.... If this asymmetry is necessary for your application, use another language. For gawk, it is more important to enable users to write a client program with a minimum of code.

Precisely!

By the way, gawkinet actually extends rather than introduces the "|&" operator. [Please correct me if I'm wrong about this.] "|&" was created as an awk-style way to set up pipes for communicating between co-processes, and gawkinet extended it to work with network sockets. I think that's rather neat.

gawkinet is old news

The latest development in awk-land is xgawk.

For example, this code prints out an XML document as an indented text outline:

     BEGIN        { XMLMODE=1 }
     XMLSTARTELEM {
       printf("%*s%s", 2*depth, "", XMLSTARTELEM)
       for (i in XMLATTR)
         printf(" %s='%s'", i, XMLATTR[i])
       print ""
       depth++
     }
     XMLENDELEM   { depth-- }

It is essentially a high-level programming interface to SAX-style streamed XML processing.

Gawk ugly hacks...

As a Plan 9 user I'm not impressed by gawk and it's many hacks, in Plan 9 we have /net that is a real file system available to any language or tool.

Awk works best for certain tasks, but it's not the best general purpose language. Gawk tries to make up for that with tons of extensions and hacks which kill the most important beauty of awk: that is simple, clean, and does one thing, and does it very well.

A very good and powerful combination which I use all the time is rc and awk, both are quite different but very simple, and complement each other extremely well.

"The AWK Programming Language" is certainly one of the programming books every written, like all other books by Kernighan.

As for xgawk, I won't say anything except that is sad to see something as clean and pure as awk infected by the XML disease.

Ugly hacks?

While I'm still not sure I like how xgawk works, it isn't part of the core (yet?). From looking at the gawk documentation it appears that the current maintainer (Arnold Robbins) has done a good job keeping a fairly minimal core (even to the point of NOT fixing some of awk's warts). This is why I like the way networking is integrated into the file handling code.

There are quite a few limitations in awk that prevents it from becoming a general purpose language. What is interesting is how you can take a minimal language like awk and extend it without destroying its original intent or feel.

For instance: gawk has a special variable called BINMODE that allows a file to be handled without character translation. Its original intent was to allow more flexible end-of-line handling, but some people are trying to use it to read "binary" data. It works. Mostly. RS still determines what the record separator is (which stands in the way of binary code), and you still only have getline to read data. But, the inclusion of BINMODE solves some problems for some people. Is BINMODE an ugly hack or is what people try to do with it ugly hacks?

lots of little tools

I don't think Gawk is attempting to become a general-purpose language for Unix aficionados -- Perl already has that niche sewn up. A quick look at comp.lang.awk shows that the community is aware of awk's strengths -- one-line text processing in shell pipelines, and simple scripts for use as filters.

I think that the Unix shell tools would benefit from a real /net interface like Uriel mentions, but gawk is a simple way to get generic TCP/IP into the shell environment.

xgawk is a reaction to XSLT's verbose approach to simple XML transformations.

Here is my philosophy: The key to effective Unix shell programming is to never attempt to do too much at one time, but instead perform a single discrete functional unit on a data stream per stage in the pipeline. If any of your stages significantly break the four-dozen-line barrier, then there are three likely issues:

  1. You haven't decomposed the problem enough.
  2. You are using the wrong tool or language for this particular pipeline stage.
  3. Your problem doesn't do well in the shell pipeline paradigm.

I think gawk is a good language for that approach, and I think that the constraints of the shell approach are being kept in mind by the gawk language designers. I think it's still clean.