Text Processing by Functional Language?

Doing computational linguistics, I've passed through several languages and settled on Python as the most convenient thing so far for text processing. I've also read a lot about the attractions of functional programming and have wanted to try applying it to my own work.

However, whenever I see FP application to things like string manipulation and regexes (e.g. in Haskell), it seems horribly messy to me, done just to show that it can be, not because it's a good idea.

Am I the victim of bad examples or is text processing something that just shouldn't be done with a functional language?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Ocaml have regexps

functional programming and regular expressions are not related.

Ocaml does have regular expressions. However, you cannot use the ocaml match construct to deconstruct a string using a regular expression. You might be interested in using Camlp4 or Camlp5 for that.

Mi(c|k)match

And the place to start with that would likely be with Martin Jambon's micmatch (or mikmatch) packages.

Have you seen this?

Have you seen this?

Lazy Functional Programming

Thanks for that reference. It pointed me to some interesting work.

I'm wondering about the nature of lazy functional programming. Briefly reading about LFP and looking over some Haskell code, I'm reminded of work in natural language processing with Prolog. Although the declarations are functions rather than implications, the approach to problem decomposition seems the same.

For programs of realisitc size, Prolog requires you to provide at lot of guidance to the resolution engine. It seemed to me that you could always get the same result with less effort by procedural programming. I gather that is why Lisp has usually been preferred over Prolog for AI research (at least in the US).

How does Haskell compare to Prolog in this respect? Are you able to scale up without having to provide procedural guidance? How different is problem formulation in the two paradigms?

Functional Programming Languages excel at text processing

In fact, I would say that they are one of the more compelling applications of FP. Take XML processing, for instance. (Admittedly a bit different than natural language processing, something I don't personally know much about.) Representing XML as a simple value, as an s-expression or algebraic datatype, is far superior to the SAX or DOM.

Speaking of Haskell, check out Text.ParserCombinators.Parsec. It's perfectly capable of elegantly handling regular expressions, as it has sequencing (via do blocks), unions via <|>, and kleene closure via the function many. There are also libraries for pretty printing and whatnot, several bindings to PCRE if you want a more "conventional" syntax for regular expressions, and so forth.

I also like and use Python, though it's far from my favorite language.

As always it depends...

Functional languages are really good at allowing you to define a mapping cleanly. This is essentially what they were defined for, as many programming tasks can be reduced to mapping data from one domain to another. Some forms of text processsing fall neatly into this category.

Leon above mentions Parsec which is a really nice way to write parsers. The result of doing something complex in Parsec is something simpler than using a standard grammar tool and attributes. The source of this simplicity is that everything is being done in a single language, so you don't have as many semantic boundaries to worry about.

For the kinds of text processing that most language people do(er, meaning artifical languages) everything can be broken down to a simple structure and repeated. The simple structure is transform this string of symbols into this nice clean tree of symbols. Then map each tree onto a new tree by considering some properties of the symbols. For artificial (context-free or simpler) languages a functional program is an elegant way of writing this tree construction and mapping.

Now, when you say natural language it depends entirely on what you mean. I'll assume that you mean something context-sensitive that can't be represented exactly by a grammar tree. Or something that would require a grammar tree + exceptions because it is naturally a graph.

Unfortunately graph manipulation is a lot less elegant in functional languages than tree manipulation. Because the problem of graph analysis or graph mapping doesn't break cleanly into similar sub-problems like a tree. Instead there is a certain amount of overlap, or sharing, between sub-problems and you end up mutating state to account for it. For this reason program that manipulate graphs are more difficult to write in a functional language, than in a n imperative language. I should qualify that, by adding, in my own experience and opinion. Indeed I would be interested in hearing if other people have found the converse to be true.

A question that I would ask is what kind of processing are you going to do on these natural language texts?

Graphs

Andrew makes a number of salient points, moreso than my original post. I was being little more than a cheerleader for functional programming in that one. :)

I've done some work with graph algorithms in functional languages, and I would agree that FP is rather... disappointing... in what it offers. However, just because the relative benefits are much smaller when dealing with graphs algorithms than list- and tree- structured data doesn't mean that functional programming is any worse than the imperative alternative.

A few thoughts:

1. The Functional Graph Library looks interesting and useful, though I haven't used it extensively. By the same token, it doesn't quite feel right. A number of the higher-order functions the library offers don't have nice properties in general, and worse, it's not clear what properties their arguments should have to in order for the result to have nice properties.

2. Functional programming is a fairly big win when it comes to graph construction. One way to represent a graph is by a function from nodes to neighboring nodes, and this representation can often be conveniently defined and manipulated via higher order functions. For example:

type Graph a = a -> [a]

divisorGraph :: Graph Integer
divisorGraph  = \n -> [ d | d <- [1..n],   n `mod` d == 0  ]

cross :: Graph a -> Graph b -> Graph (a, b)
cross f g = \(a, b) -> [ (a', b') | a' <- f a,  b' <- g b ] 

This representation is also reasonably handy for graph algorithms. It would be interesting to see what somebody could do with these ideas and defunctionalization/refunctionalization. (Maybe somebody's already explored this avenue?)

3. Graph algorithms. Yeah. Not so hot in Haskell/Lisp. However, I don't think it's so terrible either. I'm certainly not convinced that graph algorithms in functional languages are any worse than graph algorithms in imperative languages.

After all, Haskell is one of the finest imperative languages ever devised. (Ok, I'm a bit biased. ;-) You can always employ the ST monad if you need some efficient mutation.

And, I still have hope that someday somebody will come along and show us the way, and make complicated graph algorithms nearly as easy and as natural as complicated tree algorithms. Hopefully these idioms will be applicable in Haskell and Lisp.

Actual text-processing languages

There have been languages designed for text processing specifically. You may want to have a look at Snobol, Icon and AWK.

These days a more popular approach seems to be to take a strong enough general-purpose programming language and to add support for regular expressions as a non-core feature. This separation has its benefits, but the best possible support for text processing is not one of them.

Full disclosure: I'm paid to design and develop (but not to advertise) OmniMark, a text-processing language. So I may be biased.

Thanks to its powerful type system and clean design, Haskell is a great language to host any DSL in, and its laziness makes it additionally suitable for text processing. The only reason I wouldn't recommend it today is because the libraries (or DSLs) for text processing are not mature yet. The introduction of ByteString still has to propagate through higher-level libraries. I believe a recent revision of Parsec (for example) has been parameterized to work over non-list container types, and it may still be unstable.

FP is a good choice...

In Feb this year (2008) I had a contract for 3 weeks and I don't think I could have pulled off what I did unless I had used PLT Scheme. Essentially I wrote a parser for 15 year old Dec/VAX DATATRIEVE record definitions and then used the AST to create a DATATRIEVE procedure that created a massive CSV file.

FP really makes you think about what you are doing to the point that I found myself using pen and paper before entering the code!

Being able to create ad-hoc list structures and pass them around in a CPS style was also something that not many other languages could have manged *quickly*... the ability to create closures containing not just data but handlers as well is really useful. Of course you can do that in other languages as well.

The other major benefit (at least for me) with FP is all of the state-based problems that just do not arise at all as most of the time you are passing the return from one function into another.

The PLT Scheme was the only real choice at the time because it has a fantastic library that allowed me to string-=split and also use telnet and ftp to transfer data to and from the VAX.

Functional Programming + Coffee = Awesome solutions.

What kind of language is sed?

The sed language has no variable bindings and is incapable of I/O. If we imagine that ; or newline are "backwards composition", and matches are guards, then a sed program is essentially a bunch of compositions and function guards. It's a delightful language for text processing. The difficulty most other functional programming languages have is lack of syntax support -- s/// and // and backwards composition. To do it in Haskell, I think you'd need a mix of template Haskell support and a new monad. I should do it, really, if I can find time...

Tetris!

I can't resist posting Tetris in sed. You have to press enter after each keystroke for it to process. Press enter at the beginning to see instructions. It won't work with all versions of sed.