A data-directed approach to lexing.

I invented a simple way to write lexers. I should probably write a formal paper about it but I have no academic institution nor business pushing me to, so I probably won't. It is data-directed, so it has flow-of-control far simpler than most lexers. It relies heavily on bitmask operations instead of control logic.

The lexer uses a hash table. The table is keyed by an input character and optionally by the position in the token where the character is read. There is a 'base' entry for every character, automatically generated from the character properties and the lexical rules for simple nonfixed tokens like identifiers and numbers. There are additional entries for characters which can appear in fixed tokens, indexed by both the character and the positions in which it appears in fixed tokens. The data stored in the table is bitmasks.

The number of bits in each bitmask in this system is one per token class you want to recognize. Most of these are fixed tokens such as keywords and operators. Some are nonfixed tokens such as identifiers, numbers, and the text of inline constants. Depending on the complexity of the nonfixed tokens some may need to be represented by more than one bit. For example if you have a "no tabs in indentation" rule you may have type-1 whitespace that consists of a newline followed by some number of spaces, and type-2 whitespace that may contain spaces and tabs but no newlines. The parser could then enforce the no-tabs-in-indentation rule by declaring an error if it finds any instances of type-1 immediately followed by type-2.

Each fixed token corresponds to one bit on the bitmask. If the bit is on, then that keyword or syntax has not yet been ruled out as the value of the token being parsed.

So the lexer works thus:

// PSEUDOCODE ONLY. THIS IS NOT EXPECTED TO COMPILE, THIS HAS NOT BEEN TESTED, AND IT HAS NOT EVEN BEEN FORMALLY VERIFIED.
{
   for (index=0, workmask = oldmask = ALL_ONES; workmask != 0; index++) {
      input = readInput();
      oldmask = workmask;
      workmask &= table(input, index);
   }
   unread(input)                        // push excluded letter back into file cache, to be read as first character next time.
   repeat{ 
      tokenidx = round(log2(oldmask));                        // get the index of the highest bit set
      if ((tokenidx < fixtokencount) ||                       // if it's the bit of a nonfixed token return that 
          (tokenlengths[tokenidx] == index return(tokenidx);  // if it's the bit of a fixed token check to make sure length matches.
      oldmask -= 0x1<<tokenidx;                               // clear the bit.
      }
   until (oldmask == 0);                                      // until we have returned or all remaining set bits have been checked.
   print ("This never happens.");                             // If we get here it's a bug, so report an error.
}

At the start of a new token, it sets its working bitmask to all ones. Then it iterates through the input. Each time it reads a character C at position P, it fetches bitmask(C,P) from the hash table and ANDs it with its working bitmask.

So for example, we can set our working bitmask to ones, then process 'w', 'h', 'i' 'l', 'e', ' ', by masking it with table('w',0), table('h',1), table('i',2), table('l',3), and table('e',4). At this point the bit for the 'while' keyword is the only one still set. When we mask with table(' ',5) it goes dark. We check that the length of the token corresponding with that bit matches the number of tokens we've processed and output 'while'.

(note: in this case the 'while' bit and the 'identifier' bit will both be set. After processing the table(' ',5) both will be unset, so both are candidates in the 'last nonzero mask' rule. But given a conflict between keyword and identifier, we will always pick the keyword.)

The length check is to handle things like '-' and '--' and '-=' all being syntax. After masking with table('-',0) all three will still be set. Some letter in position '1' would mask all three at the same time. But only one of those three finalists has length equal to the number of characters processed.

There is a 'base' bitmap for each character. These simply define a character's relationship to the non-fixed categories with properties such as whether it's a digit, hex digit, binary digit, possible identifier character, etc. These are auto-generated based on the character's properties, and ignore the fixed tokens completely.

What I like about this method is the absolute simplicity. of setting it up. You simply iterate through your list of fixed tokens, doing very obvious things with each:

// PSEUDOCODE ONLY. THIS IS NOT EXPECTED TO COMPILE, THIS HAS NOT BEEN TESTED, AND IT HAS NOT EVEN BEEN FORMALLY VERIFIED.

for (bitplace = fixtokencount; bitplace >0;  bitplace--){         // fixed-tokens have highest priority so they get highest bit positions 
   bitval = 0x1 << bitplace;                                       // nonfixed tokens like identifiers and numbers will have low bit positions
   for (letters in fixtokenlist[bitplace] )
       Tmask = table(letter,place) if it exists, table(letter,base) if it doesn't.  // get existing or copy default bitmask for letter and place
       Tmask &= ~bitval;                                                            // poke a hole at bitplace
       table(a,place) = Tmask.                                                      // store the changed mask back in the table
    }
    Lexerlengths[bitplace]=place;                                                   // store the length of the token in the lexer's length table.
}

And you are done with fixed tokens. The lexer will recognize all of them, won't hallucinate tokens that don't exist, will run reasonably efficiently, and most important, the source will explain very clearly exactly how and why.

There is no "mysterious" or "difficult" or even "complicated" thing going on here. Deterministic finite-state Automata and Chomsky grammars and state machines are lovely, but the way this works is so brutally simple it should be clear to any student on the first day it's explained.

Benefits:

Stunning simplicity of implementation. Better scaling than state-machine or flow-of-control approaches. Takes advantage of bit operations to parallelize a lot of things that would otherwise require sequential or carefully state-dependent implementation. Likely to further parallelize more easily than existing approaches.

Costs:

One Hash table lookup per character read is definitely the most expensive thing I'm doing. Depending on the size of the code set in use and the size of the bitmap, the table can break L2 cache even on high-end desktop machines. A complex language having ~250 fixed tokens (using a 16-byte bitmap) and all ~169K unicode codepoints (excluding private use, excluding surrogates) would make the table about 2.6 megabytes plus overhead. Overhead is brutal for hash tables, so that's no smaller than 4.5 megabytes and means definitely a lot of L2 cache misses. An implementation excluding Chinese characters and using an 8-byte bitmask on the other hand could have a table under 0.5 megabytes plus overhead, which will fit handily into most L2 caches.

I'm using a logarithm operation to get the highest bit set. On reflex this seemed wrong because "transcendental function are SLOOOWW!" However on reflection I realized I had learned that a very long time ago and it's no longer true. Logarithms are implemented directly in hardware on modern CPUs and take barely longer than a multiplication.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Not sure this is new

This looks familiar, I can't find where I have seen it before. It could just be one of those things that gets reinvented?

Depending on how you set up your tables there can be tricks to make the lookup step more efficient in memory cost. If you can preprocess your token and spit out code to recompile then the compiler can do a lot for you: https://lemire.me/blog/2023/11/07/generating-arrays-at-compile-time-in-c-with-lambdas/

It could be.

It could very easily be one of those things that gets reinvented. As I said, it's blazingly simple once you've thought of it. I sort of hope at least a few people know about it; it would make me feel less guilty about not writing a paper.

I've been working on it and already made the hash table a lot more efficient. Basically I switched from tuple hashing to index hashing.

I put the bitmaps into a predefined sorted array, then hash on just the character rather than character-and-position. Instead of getting a bitmap directly, I get an offset into that array by hashing on the character and index from there by the character position to get the correct bitmap. Fewer elements hashed, and the hashed elements themselves are just indexes into the array. So the hash table is much smaller and the array uses memory efficiently.

And, since the vast majority of unicode codepoints out there are not part of any fixed token, the vast majority of them don't need full bitmaps. So I can allow the hash table to say "not found" and use the codepoint itself as an index into a "property map", where a few bits say things like "is defined in unicode", "is legal in identifiers", and "is legal in text literals". Three bits or a nybble per character is much nicer to someone's L2 cache than a sixteen-byte bitmap.

I haven't got as far as "recompile the compiler" yet; I have a static constant "property map" table, plus code that transforms a static list of fixed tokens into a set of bitmaps, sorts those bitmaps into order, then populates a hash table with indexes into that table.

The "property map" table is stable; As long as I'm experimenting and the set of fixed tokens can still be modified, it's nice to just be able to modify that list and recompile. In practical terms the time added to the startup is very small.

When the set of fixed tokens is stable, the array of bitmaps and the hash table can replace the keyword list as static constant data. Build the array and table one last time and have the program spit them out in source form. After the next recompile they would be static constant definitions to read from include files.