Incremental parser based on invariant syntax fragments

Dear LtU community,

Currently I'm working on combinator of incremental recursive descent parsers based on PEG grammars. And I would like to share my approach with you.

By the term "incremental" here I mean that the resulting parser is able to continue parsing process from any point of source code. i.e. parsers that are usually used in code editors and IDEs for live coding.

Briefly the problem is that performance of ordinary parsers is O(n) in best case, where n is the size of the source code. And we want to optimise the parser to work in O(m) approximately, where m is the size of incoming changes in the code.

Primitive solution for this problem would be memoization of AST. And binding each cached node to appropriate rule. Then running the parser again to reconstruct AST branch that covers changed part of the code.

However this approach has two disadvantages:

1) It is hard to determine which of the parental AST branch better fits to cover all the implications of changes in source code. Of course we can try each parental branch one by one from the deepest descendants up to root. But each parse try may cost additional computation efforts.

2) Another controversial problem is how to choose which nodes to memoise in order to get better performance and to save memory. Caching all nodes is overmuch.

My solution is to cache only the nodes that strongly cover invariant syntax fragments. "Invariant" means that regardless fragment's internal content it will always be parsed by the same parsing rule. And the internal content don't have any affects to the code outside it.

The example of such fragments could be parenthesis in C-like languages. More specific example is a code between function's arguments. No matter what this fragment consists of, it can always be parsed by a function's argument parsing rule only.

Fortunately selection process of such fragments between simple tokens can be performed by State Machine before the Syntax parse stage. And therefore could be easily put on incremental basis as well.

I have implemented the parser combinator in Scala language, and published it on GitHub. Also there is a blog post with some additional details. Here I also want to mention that library supports error recovery mechanism, and provides API to parse precedence operator expressions with Pratt algorithm.

I will be glad to hear any feedback from you on this approach.

Also I would like to know was this approach applied before in other products? Or at least described theoretically in papers? If so I would like to refer to them in my work. Or maybe it is quite unique? Should I try to publish more formal article then?

Thanks in advance.

Lambda the Ultimate

User login

Navigation

Incremental parser based on invariant syntax fragments

Browse archives

Active forum topics

New forum topics

Recent comments