LtU Forum

A data-directed approach to lexing.

I invented a simple way to write lexers. I should probably write a formal paper about it but I have no academic institution nor business pushing me to, so I probably won't. It is data-directed, so it has flow-of-control far simpler than most lexers. It relies heavily on bitmask operations instead of control logic.

The lexer uses a hash table. The table is keyed by an input character and optionally by the position in the token where the character is read. There is a 'base' entry for every character, automatically generated from the character properties and the lexical rules for simple nonfixed tokens like identifiers and numbers. There are additional entries for characters which can appear in fixed tokens, indexed by both the character and the positions in which it appears in fixed tokens. The data stored in the table is bitmasks.

The number of bits in each bitmask in this system is one per token class you want to recognize. Most of these are fixed tokens such as keywords and operators. Some are nonfixed tokens such as identifiers, numbers, and the text of inline constants. Depending on the complexity of the nonfixed tokens some may need to be represented by more than one bit. For example if you have a "no tabs in indentation" rule you may have type-1 whitespace that consists of a newline followed by some number of spaces, and type-2 whitespace that may contain spaces and tabs but no newlines. The parser could then enforce the no-tabs-in-indentation rule by declaring an error if it finds any instances of type-1 immediately followed by type-2.

Each fixed token corresponds to one bit on the bitmask. If the bit is on, then that keyword or syntax has not yet been ruled out as the value of the token being parsed.

So the lexer works thus:

// PSEUDOCODE ONLY. THIS IS NOT EXPECTED TO COMPILE, THIS HAS NOT BEEN TESTED, AND IT HAS NOT EVEN BEEN FORMALLY VERIFIED.
{
   for (index=0, workmask = oldmask = ALL_ONES; workmask != 0; index++) {
      input = readInput();
      oldmask = workmask;
      workmask &= table(input, index);
   }
   unread(input)                        // push excluded letter back into file cache, to be read as first character next time.
   repeat{ 
      tokenidx = round(log2(oldmask));                        // get the index of the highest bit set
      if ((tokenidx < fixtokencount) ||                       // if it's the bit of a nonfixed token return that 
          (tokenlengths[tokenidx] == index return(tokenidx);  // if it's the bit of a fixed token check to make sure length matches.
      oldmask -= 0x1<<tokenidx;                               // clear the bit.
      }
   until (oldmask == 0);                                      // until we have returned or all remaining set bits have been checked.
   print ("This never happens.");                             // If we get here it's a bug, so report an error.
}

At the start of a new token, it sets its working bitmask to all ones. Then it iterates through the input. Each time it reads a character C at position P, it fetches bitmask(C,P) from the hash table and ANDs it with its working bitmask.

So for example, we can set our working bitmask to ones, then process 'w', 'h', 'i' 'l', 'e', ' ', by masking it with table('w',0), table('h',1), table('i',2), table('l',3), and table('e',4). At this point the bit for the 'while' keyword is the only one still set. When we mask with table(' ',5) it goes dark. We check that the length of the token corresponding with that bit matches the number of tokens we've processed and output 'while'.

(note: in this case the 'while' bit and the 'identifier' bit will both be set. After processing the table(' ',5) both will be unset, so both are candidates in the 'last nonzero mask' rule. But given a conflict between keyword and identifier, we will always pick the keyword.)

The length check is to handle things like '-' and '--' and '-=' all being syntax. After masking with table('-',0) all three will still be set. Some letter in position '1' would mask all three at the same time. But only one of those three finalists has length equal to the number of characters processed.

There is a 'base' bitmap for each character. These simply define a character's relationship to the non-fixed categories with properties such as whether it's a digit, hex digit, binary digit, possible identifier character, etc. These are auto-generated based on the character's properties, and ignore the fixed tokens completely.

What I like about this method is the absolute simplicity. of setting it up. You simply iterate through your list of fixed tokens, doing very obvious things with each:

// PSEUDOCODE ONLY. THIS IS NOT EXPECTED TO COMPILE, THIS HAS NOT BEEN TESTED, AND IT HAS NOT EVEN BEEN FORMALLY VERIFIED.

for (bitplace = fixtokencount; bitplace >0;  bitplace--){         // fixed-tokens have highest priority so they get highest bit positions 
   bitval = 0x1 << bitplace;                                       // nonfixed tokens like identifiers and numbers will have low bit positions
   for (letters in fixtokenlist[bitplace] )
       Tmask = table(letter,place) if it exists, table(letter,base) if it doesn't.  // get existing or copy default bitmask for letter and place
       Tmask &= ~bitval;                                                            // poke a hole at bitplace
       table(a,place) = Tmask.                                                      // store the changed mask back in the table
    }
    Lexerlengths[bitplace]=place;                                                   // store the length of the token in the lexer's length table.
}

And you are done with fixed tokens. The lexer will recognize all of them, won't hallucinate tokens that don't exist, will run reasonably efficiently, and most important, the source will explain very clearly exactly how and why.

There is no "mysterious" or "difficult" or even "complicated" thing going on here. Deterministic finite-state Automata and Chomsky grammars and state machines are lovely, but the way this works is so brutally simple it should be clear to any student on the first day it's explained.

Benefits:

Stunning simplicity of implementation. Better scaling than state-machine or flow-of-control approaches. Takes advantage of bit operations to parallelize a lot of things that would otherwise require sequential or carefully state-dependent implementation. Likely to further parallelize more easily than existing approaches.

Costs:

One Hash table lookup per character read is definitely the most expensive thing I'm doing. Depending on the size of the code set in use and the size of the bitmap, the table can break L2 cache even on high-end desktop machines. A complex language having ~250 fixed tokens (using a 16-byte bitmap) and all ~169K unicode codepoints (excluding private use, excluding surrogates) would make the table about 2.6 megabytes plus overhead. Overhead is brutal for hash tables, so that's no smaller than 4.5 megabytes and means definitely a lot of L2 cache misses. An implementation excluding Chinese characters and using an 8-byte bitmask on the other hand could have a table under 0.5 megabytes plus overhead, which will fit handily into most L2 caches.

I'm using a logarithm operation to get the highest bit set. On reflex this seemed wrong because "transcendental function are SLOOOWW!" However on reflection I realized I had learned that a very long time ago and it's no longer true. Logarithms are implemented directly in hardware on modern CPUs and take barely longer than a multiplication.

By Ray Dillinger at 2025-12-24 06:29 | LtU Forum | 2 comments | other blogs | 642 reads

Compiling high-level code to cryptography

We showed in a forthcoming paper (CSF'24) that a compiler can translate high-level sequential code into a distributed concurrent program that uses cryptography to achieve security. All source-level security properties are preserved -- robust hyperproperty preservation. This requires a new kind of compiler correctness proof based on the kind of simulation used in Universal Composability.
See Acay, Gancher, Recto, Myers, "Secure Synthesis of Distributed Cryptographic Applications".

By andru at 2024-01-25 13:02 | LtU Forum | login or register to post comments | other blogs | 6861 reads

Haxl(-like "Monads") in terms of Delimited Continuations?

Haxl allows for IO operations to be batched by having Applicative operators that result in a "slightly unlawful" Monad instance -- rather than strictly having (<*>) = ap where ap f x = f >>= (<$> x), it's merely "morally equivalent," as it performs the IO operations in parallel rather than in serial as ap would do.

I'm wondering if it's possible to implement something similar in a language that implements effects in terms of delimited continuations rather than in terms of monads. It seems a lot less obvious how to do so, at least; when evaluating the arguments of an applicative-shaped function call (pure function, effectful arguments), the effect handler must process them one by one, and can't introspect on the continuation to see what future effects might be performed.

An unfortunate solution might be to have a domain-specific optimization pass, reliant on inlining and other optimizations, that notes an effect as having some "do these in parallel" operator and transforms a program that unconditionally performs multiple effects in sequence to a single effect that uses that operator. However, this is terribly unsatisfying, since it's a non-semantics-preserving optimization, rather than something that falls directly out of the user-written code.

By remexre at 2023-12-27 00:22 | LtU Forum | 1 comment | other blogs | 7342 reads

A case study of concatenative v.s. applicative syntax design

I implemented a language a while ago.

Based on some feedback, I change it's syntax from concatenative:

https://github.com/cicada-lang/inet-cute

- Stack-based like forth.
- Simple and shorter code.
- Easy to do refactoring.

to applicative:

https://github.com/cicada-lang/inet-js

- JavaScript-like syntax.
- Using pattern matching to define interaction rule.
- Can use currying, because we have explicit parentheses.

But I am still not sure which is better.

What do you think?

By xieyuheng at 2023-09-24 03:47 | LtU Forum | 6 comments | other blogs | 12336 reads

Using JavaScript-like syntax to program with Interaction Nets

website: https://inet.run

code: https://github.com/cicada-lang/inet-js

By xieyuheng at 2023-09-23 19:37 | LtU Forum | login or register to post comments | other blogs | 8205 reads

Sorting the travelling salesman problem

Hi There,

Perhaps someone here might be kind enough to check out a paper I have published at [1] and provide some feedback.

It is an analysis of the travelling salesman problem and the sorting problem, examining the topologies of their solution spaces.

1.- https://github.com/enriquepablo/article-tsp/blob/mirror/article.ipynb

Thank you.

By Enrique Perez Arnaud at 2023-09-05 16:33 | LtU Forum | login or register to post comments | other blogs | 8433 reads

Context Sensitivity and relational comparison operators

I'm elbows-deep in a language implementation, and since (for a new experience) it's not some kind of LISP, I'm dealing with infix operators and implicit operator precedence. I solved a little problem and thought you all might find my rambling thoughts about it interesting.

I noticed that treating relational comparison operators strictly as context-free binary operators matches the way they are in many programming languages but does not match the way they're used in mathematics (and some other programming languages).

When I write
A < B < C < D

assuming we have a context-free parser and left-associative relative operations, this is
(((A < B) < C) < D)
and we wind up comparing the result of a comparison. But this is not what a mathematician means when they write that expression. Standard infix math notation wants it interpreted as
(A < B) && (B < C) && (C < D).

Engineering considerations say we want that with single subexpression evaluation and short-circuit semantics.

If generating code for a C program, the relop doesn't check whether the left subtree is another relop; all it knows is that the left subtree returned #false (or #true), so it compares its right subtree to that value. And treats booleans as a subrange of numbers [0..1], so the comparisons have humorous results that require "inside" knowledge about the binary representation of booleans to comprehend. We get single subexpression evaluation but no short-circuit semantics.

In languages like Java, comparing booleans and numbers is a static type error, because static type errors are probably more useful than humorous results that require inside knowledge about binary representation of booleans to correctly interpret. No subexpression evaluation, no short circuit semantics.

If it's a logic-descended language, the comparison may return NIL or a non-NIL value, and the non-NIL value it returns is the value of the last subexpression evaluated, meaning the value of the right subtree. This treats numbers as a subrange of boolean values (all of them are #true). And any relop whose left subtree is NIL returns NIL without evaluating its right subtree. An analogue of the mathematician's interpretation of chained relational operations falls out "naturally" and if you never do numeric operations on a value you got in a context that encouraged you to think of it as a boolean, you will never notice the difference. You also get single subexpression evaluation, you get short circuit semantics - all seems good!

But it breaks static type checking. This means NIL is its own type, and relational operators have to check for it at runtime, and it can get returned to continuations that expect numbers. So now everything that uses booleans or numbers has to do runtime type checking. From a static type checking POV treating numbers as a subrange of booleans is even more expensive than treating booleans as a subrange of numbers.

When I'm traversing the abstract syntax tree after parsing, I can have different code templates to emit code for relational operations. I pick one depending on whether a subtree or the parent node is or is not another relop. So I get static type checking, I get the mathematician's interpretation of chained relational operators, I get the efficiency of caching a single evaluation instead of expanding a macro and evaluating something a second time.... all is good.

But now I broke referential transparency. Identical subexpressions encountered in different contexts mean different things. All the type checking can be done statically, but the relational ops other than the root of the chain are making a choice between jumping to their own continuation site in their parent relational op (to return a number) or to the boolean continuation at the call site of the whole chain (to return #false). Only the root of the chain is special; it has no numeric continuation, and can jump instead to the exit continuation to return #true. This is statically typed return polymorphism.

I was considering whether to be upset about breaking referential transparency, but then I realized people are using languages with exceptions all the time without complaining about it, so referential transparency can take a walk, I guess, if it gets us static type checking.

By Ray Dillinger at 2023-08-16 18:50 | LtU Forum | 8 comments | other blogs | 32645 reads

Typesetting a Type System with Color-Coding

I've recently been working on formalizing the type system for one of my programming languages. While writing out the rules for inference I found it helpful to use both traditional type syntactic inference as well as some categorical language. The back-and-forth is so common that I have considered color-coding terms to indicate when they correspond to an object or morphism in which category.

My question for LtU is whether color-coding seems goofy or maybe helpful in some contexts. I have considered alternatives and I think there are three options:

1) Color-code terms by Category
2) Add text describing which terms correspond to which Category
3) Alter the term/type syntax to make category an explicit syntactic property

By andrew johnson at 2023-08-05 16:32 | LtU Forum | 2 comments | other blogs | 2917 reads

The ALTernative programming language

I've just finished documentation on my new programming language ALT.

It took me a lot of attempts (+- 20 implementations in scala and typescript) to come up with the current semantics. I would be grateful for some feedback.

By Robbert van Dalen at 2023-07-30 18:39 | LtU Forum | login or register to post comments | other blogs | 2080 reads

PL Tea event today 26 July @ 14:00 New York time`

Info should be available at:
https://groups.google.com/g/pltea

By cpurdy at 2023-07-26 13:39 | LtU Forum | login or register to post comments | other blogs | 1709 reads

Lambda the Ultimate

User login

Navigation