V-Parser

V-Parser is a novel chart parsing algorithm I've been developing recently. The first open source implementation is at GitHub. You can test it at online grammar development environment. This is the pseudocode:
01  DECLARE chart: [][], text: STRING;
02 
03  FUNCTION Parse (grammar, input)
04      text ← input;
05      chart.CLEAR ();
06      MergeItem (0, [grammar.TOP_RULE], 0, null);
07      FOR each new column in chart
08          FOR each new item in column
09              FOR each alternation of item.Sequence[item.Index]
10                  MergeItem (column.Index, alternation.sequence, 0, item);
11  
12      RETURN chart;
13  
14  PROCEDURE MergeItem (offset, sequence, index, parent)
15      item ← chart[offset].FIND (sequence, index);
16      IF not found item THEN
17          item ← {Sequence: sequence, Index: index, Inheritable: [], Inheritors: [], BringOver: []};
18          chart[offset].ADD (item);
19  
20      inheritors ← [item] UNION item.Inheritors;
21      IF item.Index + 1 == item.Sequence.LENGTH THEN
22          inheritable ← iff (parent is null, [], [parent] UNION parent.Inheritable);
23      ELSE
24          inheritable ← [item];
25          IF parent is not null THEN item.BringOver.ADD_IF_NOT_EXIST (parent);
26  
27      FOR each x in inheritable
28          FOR each y in inheritors
29              x.Inheritors.ADD (y);
30              IF (x.Sequence, x.Index) not in y.Inheritable THEN
31                  y.Inheritable.ADD (x);
32                  IF x.Index + 1 < x.Sequence.LENGTH AND y is terminal succeeded in text at offset THEN
33                      FOR each z in x.BringOver
34                          MergeItem (offset + y.LENGTH, x.Sequence, x.Index + 1, z);
For a sake of simplicity, we present the algorithm that operates on classical context free grammar (CFG) where each rule is represented in the following form:

X -> A B C ...

where X is a rule name, and A B C ... is a sequence of rules named: A, B, C, and so on. A, B, or C may be terminal constants, as well. Alternations are noted by having the same rule name on the left side over multiple rule definitions in a grammar.

V-Parser is a chart parser that groups parsing items into columns that correspond to offset from the beginning of input string. Columns are incrementally processed, never looking back into previous columns in the chart. V-Parser stores its items in chart as pairs of a sequence and an index of the sequence element. This way it is always possible to know what is an ahead element of the current item (we just increment index property). The main function Parse serves as a loop over columns, items and their alternations. It repeatedly calls MergeItem procedure to populate the chart onwards.

MergeItem procedure creates a new item in the current column determined by offset only if the item doesn't already exist. Properties Inheritable and Inheritors are used as pointers to parents and items that inherit these pointers, respectively. Parents in Inheritable property are accumulated over children, meaning that each child has pointers to all of its direct or indirect parents.

Lines 20-25 make sure that Inheritable is properly set up in a case of pointing to non-last index of the symbol seuence. BringOver property is used to remember parent ahead symbols, and is used when we get to the point when we reach the last sequence symbols in parsing.

Lines 27-34 loop over each inheritable, and further reach to each inheritor. If inheritor is a successfully parsed terminal, and if inheritable is an item with non-last index, algorithm populates ahead item in corresponding chart column. The whole loop basically makes sure that newly realized ahead symbols are properly distributed over the chart, at positions determined by relevant past parse terminals, including the current one.

The algorithm stops when it runs out of new items in further columns.

Building compact parse forest (CPF) by V-Parser is trivial, because we just have to assign parsed extents to terminals. Reporting errors should also be relatively easy by analysing the last column in the chart.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Input and output

The first thing that makes my implementation of the parser different from the others is its compact parse forest (CPF) output. Given an input grammar-tree, CPF takes almost the same form as the input. The only difference is enrichment by extents property at objects that represent terminals. These extents hold an information about where the terminals occur in parsed input string, also as which is their right extent. If we want to later process CPF, we have to keep track of these extents, constructing the actual sequences of rules by investigating extents of their direct / indirect terminals. It appears that this is just enough to recreate parsed input string, by pairing the right extents of previous rules to the left extents of next rules in sequences. CPF is being extremely friendly to memory, as it holds minimal range of informations about parsing. However, we have to be careful when processing CPF because it also holds informations about uncompletely parsed sequences. That means that prior to processing a sequence at given place, we have to check if the sequence is parsed up to its last element. Once we know this, we can be sure that the sequence fits into given place. Otherwise, we can use this incompleteness to indicate an error in parsing the sequence at given place.

Input of my implementation is also somewhat different from what is usually seen in other parsers. It is a kind of structured BNF-ish rules that are passed in a form of JSON. These rules modularly fit one into another, thus forming a grammar tree. The grammar tree is cloned upon parsing and reused to build CPF.