The Problem With Parsing - an essay

After some frustration with trying to write yet another parser in C, I decided to bang out some ideas on programming, software design, and the software language problem into an essay entitled "The Proglem With Parsing - A World Transormation Discussion".

It's not too technical, but a friend of mine read it and found it interesting so I thought I would share it with you all. You can read it at:

http://kruhft.blogspot.com/2006/03/problem-with-parsing-world.html

--
kruhft

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Maybe you could use Boost.Spirit or Gold Parser.

With Gold Parser, you can write your LALR grammar into the Gold Parser IDE, which will produce a table of values, then use a Gold Parser engine to parse text according to the table.

With Boost.Spirit, you write the parser as a series of parser objects using C++ operators to bind those objects together.

Messy, but not difficult

one of the most difficult problems with programming is the movement of data from the human oriented representations given by the users to he format required by the computer. this is a technique known as parsing, and is the source of many hours of frustration on the part of many programmers

I disagree that it's one of the most difficult problems. Parsing always seems to be messy in one way or another, but I wouldn't classify at difficult (unless it's just something you haven't done much of, you've never used yacc-like tools, or you've only written parsing code in C).

Still, avoiding parsing by having nicely designed data formats is certainly preferable.

Parsing can be easy...

Parsing can be easy, if you use the right language. For example, if you were using LISP or Scheme, parsing LISP-like or Scheme-like code is a snap -- not only does the stuff parse without any headaches, but you get the AST for free.

lisp is interesting

yes, i am aware of the simplicity of parsing in lisp, but unfortunately, when writing a language parser in lisp, you end up with a language that...ends up looking like lisp (since it's so easy :)

not that there's anything wrong with looking like lisp...

i find the syntax maelability of haskell very intesting, and after finding gofer with it's tiny implimentation size and ability to easily recreate the prelude i've been thinking of playing with redoing the haskell syntax a bit.

i think my problem is my attachment to the 'low level', which comes from my days of work doing assembly and c programming on various tiny machines in various industries; i'm trying to adapt to using higher level languages, but if i don't know what's going on under the hood i tend to have a bit of a problem with using them. can't really let go i guess :)

--
kruhft

Parsing can be easy and fun !

From my experience, parsing can get really fun and easy, provided you have tools that "ease the pain". I would recommend you the excellent "Toy Parser Generator", which allows you to build a parser in minutes, provided you know some Python and quickly read the (concise) doc !

no, it didn't have much with parsing...

and yes, it was as unedited brain dump with some ideas that i'd been having about the subjects that were discussed in the writing. it was not meant to be new, but something that might be interesting and entertaining to read about the topics that we deal with when dealing with computers and related fields of mathematics and philosophy.

And yes, I do know when and how to use capitals, but the point of the peice was to write something; I find it unfortunate (but not suprizing) that people stopped reading when they say there were none. I will attempt to be more formal in my next attempt at submitting content.

--
kruhft

The reader has no obligation to decrypt your writing

Eschewing capitalization is like writing uncommented code - it optimizes for the writer instead of the reader. If you want people to read your writing make it easy for them to do so. The web has a vast amount of free content and an audience with the attention span of gnats.

I don't care if an essay

I don't care if an essay eschews the use of capitalization as long as it is consistent in its choosen style, which it was (at least in regards to capitalization).

Granted, the essay WAS a braindump (with the author even commenting himself that it might not have much to do with parsing), but I would be sad to see people refuse to read it because of its lack of caps, as opposed to its content.

I would be sad to see people

I would be sad to see people refuse to read it because of its lack of caps, as opposed to its content.

There's a very good logical reason for this. As has already been mentioned, there's tons of stuff to read on the web, so we have to make judgements as to what's worthwhile. Lack of capitals conveys laziness, lack of education, and sloppy thinking. Whether any of these are actually true in a particular situation is besides the point -- it's a matter of probability.

I also consider it common courtesy that you should follow conventions to make your writing easier to read.

Not so sure ...

I'm not disagreeing with the poster, but I'm not so sure it's a matter of courtesy or conventions.

The glyphs we use for letters have evolved over time for various reasons, but they happen to serve a very important function now: uppercase letters are a simple form of markup. If I see a period followed by an uppercase letter (neglecting whitespace), my brain tells me that I'm probably looking at the start of a sentence. My brain doesn't have to work as hard to find the end of the sentence. Furthermore, it's been demonstrated (although I couldn't quote any references) that we generally read by looking at the shapes of the words, moreso than the actual individual letters. Take away the capitals, and you've eliminated one of the cues we use to distinguish words in a sentence. It's even worse with all-caps: capital letters tend to all have the same square-ish shape, so when you write in all-caps, you make things really hard to read, because the reader has to scan individual letters.

Having said all of this, I'm not sure it's worth dismissing something just because there's no capitalisation ... it just makes it a little harder to read.

The glyphs we use for

The glyphs we use for letters have evolved over time for various reasons, but they happen to serve a very important function now: uppercase letters are a simple form of markup. If I see a period followed by an uppercase letter (neglecting whitespace), my brain tells me that I'm probably looking at the start of a sentence.

I've just started to think that I've been reading so much code over the past five years that I've just started to pick apart sentances and compile them around punctuation. The lack of conventions with code (or the proliferation as it may be) has literally changed how I read; I scan for 'terms', analyze and move on to the next. This could also do with the cut up, non-linear, Faustian style of writing that I've been appreciating for the past many years.

Reading programs is very much up to interpretation of the code; to figure out what the program does requires that you execute it in your head and the same can be said of certain types of literature. Normal writing can get boring after a while when you're used to mixing and mashing the content to find meaning (in a bit of a Cabalistic sense, I would guess). Oh well, at least it passes the time.

This whole branch of the

This whole branch of the conversation has become ridiculous.

Look: some people mind lack of capitalization; some people don't. I don't see the choice to eschew capitalization as one of laziness, but merely as a personal preference.

Nevertheless, everyone is entitled to their opinions. I would not deign to take that right away from others. But I do feel that the amount of intellectual thought and energy devoted to this particular pet peeve beyond what it is necessary. Frankly, this sort of stuff doesn't belong on a discussion thread; it only serves to derail the topic of conversation. In fact, that's already happened here.

Stylistic Choice

I don't see the choice to eschew capitalization as one of laziness, but merely as a personal preference.

In this case it was a stylistic choice. I have two main sites, kruhft.blogspot.com, which is my site for my 'artsy' side, and metashell.blogspot.com, which is my more technical side. I was a bit hesitant to post on of my more philosophical 'brain dumps' to a site like this, but programing language research has lead me along a fair number of paths, towards linguistics, mathematics and philosophy; I thought that since I was directed in those ways by the topics that I studied, that others that are interested in programming languages (such as the readers of this site) would be directed in that way as well.

Like I said previously, the article was meant for entertainment, and I did feel a tinge of regret once I posted remembering that I didn't use CAPS, and the response here was akin to a large red F on the front of the paper that I just handed in. Of course the internet is not university, and although there is a glut of content and information available and one has to have filters just to even consider reading it, I was hoping that some people would forgive me in that respect and actually read the content, as unedited and dumpy as it was.

I apologize for the long and pointless discussion about literary sytlistic choices; I was hoping for a discussion on computational philosophy :)

Uncommented code

I like uncommented code, at least there isn't a half-arsed effort left over to describe what is being done from an initial implementation 5 versions ago lingering around to throw you the wrong way. As the saying goes, "Code that was hard to write should be hard to read". Proper tools and understanding program structures is much more important than the occasional comment. Literate programming is a bit different of course, but then again that's a completely different style.

"Code that was hard to write

"Code that was hard to write should be hard to read"

It's much harder to write code that's easy to read than it is to write code that's hard to read. Writing readable code is the sign of a good programmer who understands what he's doing. Unreadable code is the sign that a programmer is out of their depth. Programmers who find it hard to write code find it hard to write code that is easy to read.

It seems to me this thread

It seems to me this thread has run its course.

I don't see it becoming informative from PLT perspective.

language features and patterns

Unreadable code is the sign that a programmer is out of their depth.

Of course, it could be the reader that is out of thier depth when it comes to programmers that go beyond the standard features of a language and actually use what is available in the specification. I've worked with people that couldn't understand the use of a function pointer jump table (granted not the most esoteric features of the C or ASM languages, but one that is very useful for certain situations), and I admit that sometimes reading a program using that type of 'software machine' can be difficult to read, at least on first glance. A lot of code requires some thought before it can be understood; simply glancing over the comments, which are generally poorly written and omitted from the parts that actually need explanation, does not help with understanding of a section of code and generally *hinders* it.

--
kruhft

Any programmer who uses

Any programmer who uses esoteric features of the programming languages should be able to use the language to express what their program is doing. If not, they are working beyond their ability, concentrating too much on the technology and not addressing wider, social concerns. All languages have abstraction mechanisms that allow a programmer to express how higher-level concepts map down to implementation details.

You're right about comments: too many comments is a sign that the programmers have a poor understanding of the language.

I enjoyed your thoughts

Too bad grammar nazis and those aflicted with personality disorders had to ruin the thread