Katahdin: Modifying your programming language as it runs

Katahdin is a programming language where you can define new language constructs such as expressions and statements as easily as new types or functions. For example, you could define a new operator, a new type of loop, implement a syntax from another language that you like. After defining a new construct you can use it on the next line in the same file, so there is no need to recompile each time you want to add a new construct. Katahdin is powerful enough that you can define an entire existing language, or design a new language from scratch, making Katahdin a universal language interpreter.

For example, most programming languages have a modulo, or remainer operator. To define one from scratch in Katahdin it is only a few lines:

class ModExpression : Expression {
  pattern {
    option leftRecursive;
    a:Expression "%" b:Expression

  method Get() {
    a = this.a.Get...();
    b = this.a.Get...();
    return a - (b * (a / b));

Katahdin is an application of parsing expression grammars and packrat parsing, an active area of research. The site contains a master's thesis and public domain source code.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

A suggestion

Thanks for the contribution, this is a very cool and promising looking language!

Might I suggest using a slightly more sophisticated example than defining an operator to showcase your language?

What you have accomplished is very impressive, but the example you use is underwhelming. I could show modulo operator definition in Cat:

  define % { dup2 div mul sub }

And say "look it is just one line", but that is not a fair comparison because what you are doing (modifying the language grammar on the fly) is fundamentally more important. Your canonical example should showcase that.

Just my two cents. I look forward to reading more about Katahdin.


You really should mention DSL, because if it's one of the most practical uses for your language -- define and use DSL in the same module, at runtime.

Congratulations on a very cool project! I attempted to accomplish something similar a number of times, but always lacked the time (and probably expertise) to finish it, so I'm happy to see this done by someone. Cheers!

re: DSL

One thing that intrigues me about using Katahdin for DSLs is the possibility of writing code in your DSL as a Katadhin program, then adding in descriptions of the language "above" the DSL piece-by-piece.

You'd end up with the code "reading" from the bottom: the high-level logic described by the DSL, then the somewhat-lower-level constructs (main entities) above it, and the even-lower-level support (operators, syntax) above that.

I wonder what assistance Katadhin can actively provide during the process of writing that inverted pyramid? It'd be interesting if the Katadhin debugger could a) be extended to support the generated parser in a Katadhin program and b) could be made interactive with the development of the program itself.

Also, I'd be interested in the possibility of "exporting" the parser at a particular stage of a program (like freezing an environment), in a form that could be quickly bootstrapped into another copy of the interpreter. That would help avoid the overhead of re-interpreting the parser code for a given module. Bonus for merging two exported parsers into a single one!

Limiting the scope of DSLs

I'd like to see something similar to namespaces for limiting the scope where the parser extensions are active. What I was trying to accomplish with my attempts of implementing a reflexive parser / compiler is something like this (assuming host language is based on Python):

from languages import sql, cpp

get_items = sql.lambda parent: SELECT value FROM items WHERE parent=$parent

    int sum(int a, int b){ return(a+b) };

reduce(sum, get_items(a))

Obviously there's a gap in typing, so this example wouldn't work, but it should give you an idea, how different languages could be used together but still be somewhat isolated.

That puts me in mind of

That puts me in mind of Microsoft's DLR, although that is of course a very different kind of system. The DLR allows this kind of mixing at the member level (that is, an object can be defined in one language, have members defined on it in other languages, and be instantiated from yet a different language).

Now I'm thinking about the possibility of a DLR client language that acts as a meta-script to other DLR-based languages, taking marked code blocks and passing them out to other languages as needed.


DLR talks about mixing scripting languages, but it's really only a layer for sharing data and code. Each language still has a separate compiler, developed using traditional techniques. The languages cannot be modifed by the user.

Katahdin does that

Katahdin does that with language statements. With the Python language module you can either write a program entirely in Python, or you can write just a part in Python, just as you describe. This program is written partly in FORTRAN and partly in Python:

import "fortran.kat";
import "python.kat";

fortran {

python {
  seed = 128
  randx = 0
  for n in range(5):
    RANDOM(seed, randx)
    print randx

Simpler languages like SQL are more seamlessly implemented. For example, with the SQLite module you can connect to a SQLite database and then query using SQL inline:

database = new Mono.Data.SqliteClient.SqliteConnection(...);
films = database? select director from films where title = "Indiana Jones and the Last Crusade";

This is all code that Katahdin can run today! See my thesis, pages 9 and 45 for these examples.

I'll read the thesis, but here's one question

I can see that the host language determines where the block of the other language ends, so it's not a full context switch, but rather a cooperation. I wonder if it is capable of nesting imported languages, for example if you were to embed some fortran code into a python function, would the boundaries of it be determined by indentation, like in python?

Pass-by-name is not Python

In this example, the FORTRAN code modifies the seed variable passed to it. This is normal in FORTRAN, but impossible in Python. This means that the language in the "python {...}" block is not Python.

This is important. In Python, if you pass a variable to a function, you know that its value won't change. (If it's an object reference, the contents of that object may change; but the variable won't refer to a different object.) Is it possible for Katahdin's Python support to make that guarantee?

Pass-by-name is not Python

It's true that python passes arguments by value, but this does not mean that a function can't affect the namespace of it's caller. Python's execution model includes the ability for code to gain access to the stack, traverse it, and modify the objects stored along it, including code and namespaces.

In particular, "if you pass a variable to a function, you *probably* know that its value won't change." There's plenty of Python code in the wild that's happy to (ab)use these features. This is one of the penalties paid for being dynamic and expressive.

I think it's impressive that passing arguments across the two, very different languages "just works" in a way that is fairly intuitive. If crossing that interface required special handling, the utility of mixing them so freely would be lost.

It's the same language.

It's the same language.

Didn't know that

Python's execution model includes the ability for code to gain access to the stack

Ugh. I like Python, and most of its metaprogramming is useful; but this just pegged my bogometer. It's reasonable to define f() such that f(x) can modify x; it is not reasonable for f(x) to modify y. That just breaks any chance of real modularity.

I don't get how...

...one can use something before defining it. By definition, since a construct does not yet exist in a program, there is no code using that construct. Therefore, how is it possible not to stop a program and define a new construct and use it at the same time?

No need to stop the parser

Obviously you can't use something before it is defined. New constructs must be defined in the program source code before they can be used, but it can be just the previous line in the same file if you want.

Katahdin lexes, parses and executes in one pass. When you define a new construct, it is added to the grammar. The parser continues from the end of the definition, now using the modified grammar, and is able to recognise the new construct immediately.

Katahdin doesn't use a parser generator like most programming languages. The parser is just-in-time compiled instead, so we are free to recompile the parser whenever we want, including as it is running, just as Java is able to recompile (re-jit) a running program.

Multi-pass parsing?

Obviously you can't use something before it is defined.

What about multi-pass parsing?


My immediate impresiion is that one cannot define a statically (or even just strongly) typed PL in Katahdin, so "universal language interpreter" should be qualified somewhat (universal typeless language interpreter?).
On the other hand, strong typing in interpreters may be a moot point...

An unrelated observation is that the shorter the name of the PL (phonetically), the more popular it becomes (compare C/Lisp to Oberon/Modula, but even the latter two try not to put two consonants together) :)

Typed languages

Not at all! I can define a method that is executed for each language construct before the program runs to check types. For example, for the add operator this method would check that the two operands had compatible types and return the type. For variable references the method would look up the declared type in a table. For simple integer or string expressions, the relevant types would be returned. Those methods would be recursively called before the program executes to verify the static typing. The program may be executed by the dynamically typed interpreter, but that wouldn't matter because the types would be verified statically. It would be entirely possibly to implement C-style static types.

Some more datapoints

There are a few interpreted languages out there that allow you to access the remainder of the program as a text stream and parse it however you want. Some that come to my mind are Forth, TeX (maybe also MetaFont and MetaPost) and Postscript. This looks a little more structured however, nice.

I believe REBOL has a

I believe REBOL has a similar ability as well, but I may be mistaken - it has been a long time since I've looked at it.