archives

A new implementation of recursive-descent parsing?

It is well known that the classic way to model a recursive-descent parser is to write (or automatically produce) a set of mutually-recursive functions either by writing procedures or combining parser modules using a DSL (like in Boost::Spirit) or using templates/generics (for those languages that support such a concept).

All the different types of implementations work in a similar manner: each procedure invokes one or more procedures. This approach has some drawbacks:

the stack can easily overflow, especially when parsing deeply-nested rules.
some of the results may be thrown away, making parsing slow.
left-recursive grammars can not be handled.
backtracking has a cost.

While trying to approach the issue of RD parsers from a different perspective, I think I have found an implementation which solves some of the problems mentioned above. I hope I am not repeating someone else's work (a brief search over the internet found no similar approaches).

The idea is that instead of the input data travelling down the tree of parsers, each parser processes the current input and then leaves a continuation on a stack. The main parse engine pops the continuation off that stack and executes it; the new continuation puts another state on the stack etc. Thus an RD parser becomes a sort of state machine.

Generally an RD parser consists of the following operations:

sequence; parsers are sequentially invoked to parse the input
choice; a parser is invoked when another parser fails.
loop; a parser that is previously invoked is invoked again.

The traditional approach handles these operations in the following ways:

sequence: formed 'naturally' by placing one statement after the other or placing parser objects in a list.
choice: 'if' statements are used to select another possible branch.
loop: a parsing procedure is invoked again with new parameters.

For example (in pseudo-Java):

class sequence_parser implements parser {
    list parsers;
    iterator parse(iterator input) {
        foreach(p, parser) {
            input = p.parse(input);
        }
        return input;
    }
}

class choice_parser implements parser {
    list parsers;
    iterator parse(iterator input) {
        foreach(p, parser) {
            iterator output = p.parse(input);
            if (output != null) return output;
        }
        return null;
    }
}

class loop_parser implements parser {
    parser other;
    iterator parse(iterator input) {
        input = other.parse(input);
        return parse(input);
    }
}

The new approach handles these tasks differently. The approach uses the following context for parsing:

a stack of parsers that is used for sequences
a stack of parsers that is used for choices
an iterator over the input
a stack of saved contexts used for backtracking

Parsing operations are handled like this:

sequence: a new parser is pushed onto the sequence stack.
choice: the altenative branch parser is pushed on the choice stack, then the first branch of the choice is invoked.
loop: the parser pushes itself on the sequence stack, then invokes another parser.

For example:

class sequence_parser implements parser {
    parser sequence_next;
    parser this_parser;
    iterator parse(context c) {
        c.sequence_stack.push(sequence_next);
        return this_parser.parse(c);
    }
}

class choice_parser implements parser {
    parser choice_next;
    parser this_parser;
    iterator parse(context c) {
        c.save();
        c.choice_stack.push(choice_next);
        return this_parser.parse(c);
    }
}

class loop_parser implements parser {
    parser this_parser;
    iterator parse(iterator input) {
        c.sequence_stack.push(this);
        return this_parser.parse(c);
    }
}

The parsing algorithm then becomes a very simple loop:

void parse(context c) {
    while (true) {
        parser = c.sequence_stack.pop();
        if (parser != null) {
            iterator output = parser.parse(c);
            if (output == null) {
                parser = c.choice_stack.pop();
                if (parser) {
                    c.restore();
                    c.sequence_stack.push(parser);
                }
                else throw parse_exception();
            }
        }
        else (if c.input_exhausted()) {
            return;
        else {
            throw parse_exception();
        }
    }
}

The above does the following:

it pops a parser from the sequence parser stack.
if the sequence parser stack is empty, then if input is exhausted then parsing succeeded, lese parsing failed.
if the sequence parser stack is not empty, then the parser object is invoked to parse the input; it is passed the context.
if the parser succeeded, then the loop continues
otherwise a parser is popped from the choice stack and
a context state is popped from the stack and activated.
The parsing loop continues if the choice parser is not empty.

This approach solves some of the problems of the traditional approach in the following ways:

stack overflow becomes an impossible task because user-defined stacks that can automatically grow is used; the practical limit is the process' memory limit. The user-defined stacks do not grow very large because the main parsing loop continuously pops items from them and new entries are pushed on the stack on a need basis only.
backtracking becomes a constant-time operation no matter how deep the parser has proceeded, because a parser state is popped from the stack. A context save is composed of the sequence & choice parser stack position and the input position (it can be down to 12 bytes if three 32-bit integers are used).

By Achilleas Margaritis at 2006-07-03 13:19 | LtU Forum | 35 comments | other blogs | 24599 reads

2006 ICFP Programming Contest registration opens

Language lovers:

Registration is now open for the 9th Annual ICFP Programming Contest!

http://icfpcontest.org

The contest, associated with the International Conference on Functional Programming, will be held on the weekend of July 21-24. The contest task will be released at noon EDT on Friday, and entries will be accepted until noon EDT on Monday. Registration is free and open to all. Teams may participate from any location, and may use any programming language(s). Last year, 360 participants formed 161 teams from 26 countries.

Prize money totaling $1750 US will be awarded to help defray the costs of travel to the conference for the winners and for small cash prizes. In addition, the winners of the contest will receive bragging rights for the programming language of their choice. This makes the contest a popular avenue for demonstrating the superiority of a favorite language, or for exercising an experimental tool.

Though the specifics are secret until the contest begins, we promise that this year's task will be very different from past competitions. This year's theme is "computational archaeolinguistics."

Stay tuned for more information as the contest approaches!

- 2006 Contest Organizers
CMU Principles of Programming Group
icfpcontest-organizers@lists.andrew.cmu.edu

By Tom7 at 2006-07-03 21:31 | LtU Forum | 9 comments | other blogs | 12072 reads

2006 ICFP Contest registration opens

Registration is now open for the 9th Annual ICFP Programming Contest. The ICFP contest is an event that traditionally raises interest in the LtU community.

A more detailed announcement found in the forum mentions that this year's theme is "computational archaeolinguistics." Intriguing.

Lambda the Ultimate

User login

Navigation

A new implementation of recursive-descent parsing?

2006 ICFP Programming Contest registration opens

2006 ICFP Contest registration opens

Browse archives

Active forum topics

New forum topics

Recent comments