Recursive Descent Parser Generators

I'm writing a parser generator, to avoid being TLDR (which is most of my messages) I wanted to ask if anyone knows of other LL(*) parser generators, what I should aim for performance wise (what's acceptable) and if you've had experience writing something similar what's one of the toughest nuts to crack on this style of parser?

For me (more than one thing) :

  1. Generating Parsers with unbound look-ahead uses a lot of memory to do statically so none (or very little) of the processing is done during run-time.
  2. Follow predictions are a PITA: you have to analyze all possible parse paths coming into the given rule, walk the context stack to disambiguate the offending lock(s), and keep stepping ahead and only accept if the parent contexts can parse and that state can also accept (this was a nightmare, but I think I have it...?) I'm writing the prediction discriminator logic now (my most recent task.) Dangling else isn't an issue because iirc it is aware that the parent context is on an edge state once the current rule exits, and I apply to the maximal munch principle.
  3. Lexical Ambiguities: I'm a glutton for punishment, I used includes to allow complex grammars to span multiple files. To avoid issues with parse order and precedence on the user's side (you can still specify which trumps the other, but this trumping order is 'always on' if used) I allow for lexical ambiguities to create new parse states and following those ambiguity identities you can figure out which path was actually intended.
  4. To solve the memory issue, I decided that I would use: a symbol stack, an active rule context, recursive descent, and reductions.
    • Reductions some of solve the issue of memory by identifying where multiple paths of a prediction all enter the same rule on the same state, parsing that rule, and evaluate what comes after that rule. This abates some of the issues with the look-ahead requirements, but adds more complexity when the follow-state of that rule is ambiguous with the prediction! I haven't even gotten to this issue :(
  5. AST Generation: making this look easy is the hard part. It should just work (I will focus on refining it once everything else is done.)
  6. It's so easy to create grammars that are simple but yield an exponential state explosion. I need to do more research to detect this ahead of time so I don't waste time on processing something useless, so the user has a better experience than 'Out of Memory Exception'.

If this ends up working as intended, I'll be astonished, because when I started this eight years ago, I didn't have an effing clue. I still probably don't, but I at least know what I want to do now. The grammars define no semantic predicates, nor actions on lexical matches, so the generator has to do all the work. So far (without error context) it's 50-80 times faster than my own hand-written parser for its own language, until I found that I wasn't handling lexical ambiguities on follow states and now have to fully implement them to do a proper comparison.

Edit:
The goal with this LL(*) approach is there is very little back-tracking employed. The only time it should need to is when it's handling left-recursive rules and it has no choice to do so because it has to fall to a weaker prediction mechanism to yield a preliminary identity onto the stack that it can use to move forward, then rinse and repeat.

Edit 2: Redacting the following, turns out the issue is more complex :) The sample I gave it had zero issues with :(
'No Alt Indirect Left Recursion' is when you have something like:
A ::= B | C;
B ::= A '[' ']';
C ::= "Other";

Without code to properly identify these 'No Alt Indirect Left Recursion' states, it will recurse infinitely because there's no fall-back to choose first, it would always predict 'B' if it has: Other[].

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

You mentioned exponential state explosion

Maybe you should just fall back on a different kind of parsing whenever you see too many states. Backtrack. Computers are so fast and large now, I'm not sure why anyone needs the most optimized tables for parsing.

Are you hoping to parse weird pathological grammars that demonstrate how parses can blow up, or are you aiming at parsing simple grammars like computer languages? If it's the latter, then you don't need anything amazing, it only has to work.

Also unrestricted left recursion in LL grammars is such a hard thing that I'm not sure that there are any programs one can find that actually do that. It will be cool if you can get that working.

Left Recursive Grammars

Yeah it's irksome.

I suspected no alternatives except the root on indirect left recursion because it seems to handle it fine, but something like a method call (PrimaryExpression '(' ... ')', where one of the indirect ancestors of PrimaryExpression are a method call) would blow up. Perhaps it's the depth of indirection. Normal Mathematical operators work just fine, I had at one point thought of switching parsing mechanisms to handle left recursion, but I kind of want it to work as is.

I would post a sample of output but my internet is on the blink right now :|

Example

Well now that the connection issue is resolved, here's a (not so) small example of what I meant:
OilexerParser

That's the parser for the grammar description language itself. As a human I handled lexical ambiguities like: "Terminal":NamedSally;Flag=true; by considering 'Flag' an identifier after lexing it, changing its identity after seeing the pattern 'Flag=True;'. Since I can't do that here (since there are no semantic predicates) I have to rely on making a new identity that represents the dual identity of 'Flag' and 'Identifier'. Last night I finished making 'discriminator' methods (not shown) that walk the context tree to see which parse path we're on to use the appropriate discriminator method, since the identifier's meaning and what comes after it may change and alter the intent of the terminal. Because 'flag' is also a valid identifier, it creates a follow ambiguity that needs resolved.

Did you consider that left recursion doesn't have to be direct?

I think there may be some top down parsers with special cases for direct left recursion, but the general problem with mutual recursion is harder.

There are also a bunch of academic papers on trying to fit left recursion into LL parsers... I remember one person/paper pointing out that a lot of the solutions are bad because they take exponential time. But it's hard to be sure if they're ACTUALLY bad because the paper used unrealistically pathological grammars as examples, things totally unrelated to grammars people would actually use.

But I've noticed at least one "polynomial time" solution that was itself unusably bad. Its theory was that no legitimate parse can recurse more times than the number of characters left in the source. You can see where this is going...

There was one mention of a polynomal time solution from a paper in the 70's that relied on saving continuations... I suspect that's probably the best one. Papers back in the 70's tended to be written for programmers not mathematicians, and my opinion was that they were more practical. However, there's a lot of confusion among less academic programmers about what a continuation even is (... I think there are various systems that have incompatible implementations of continuations, and that's not an idea that is even supposed to have more than one interpretation).

Indirect Left Recursion

Yes, there's a variety of left recursion scenarios:
Direct Left Recursion
Indirect Left Recursion
Hidden Indirect Left Recursion (one of the rules enters and accepts with no input) leading into one of the calling rules
Lexical Hidden Indirect Left Recursion (one of the tokens enters and accepts with no input) leading into one of the calling rules.

Some left recursion is difficult because of the no alternative path I mentioned. Right now in those situations it fails to parse, not due to a stack overflow but because I employ (custom) stack sniffing to check if the rule's been entered, if it has, exit abruptly without doing anything. It's something I need to revisit.

For trivial cases (which is most of the time) it seems to get it right, if you have:

A ::=
B | C;

B ::= A 'b';

C ::= 'c';

And your input string is:
cbb...bbb (total string length 36,289 characters, with ellipses representing the missing 36,283 'b' characters)

It seems to have no issues parsing this. Takes about 107 milliseconds (in debug, 80ms in Release) though, which isn't terribly quick.
(do note the ability to handle the above is recent, as I'm still actively developing this)

I don't think you should be concerned with how fast you parse

What are you planning to parse or do your parsing on where speed is an issue?

Frankly you could come up with an algorithm that's 50 times slower than an optimized parse and it would still be worth it if it had other advantages.

I feel that way about using logic languages to parse. They let you mix semantic processing in with parsing with no separation between phases and no wall between decisions made for synactic reason and branches taken for semantic reasons - at the cost that any backtracking has to undo the effects of actions. Given that they tend to create parsers with no look-ahead that's possibly significant inefficiencies.

But it makes writing parses so much simpler that I consider it a win anyway.

Focus

Well,

The focus on OILexer is to avoid any and all semantic predicates. When I started this in 2007, my experience in this was much less and I didn't have the know-how to write semantic actions in (I probably couldn't tell you what one was.) Though in retrospect I think it was the right choice because it lends to a very clean grammar.

I'm not focused on speed to the point of insanity but I think it should still be something that I consider as I go. For instance, once OILexer parsed its own grammar, I wanted to compare it against its hand-written counterpart, and was surprised that it was from 20-80 times faster depending on the paths followed. This is probably the result of what I've learned over the years (and that the generated parser doesn't currently have error handling packed in.)

I also learned that there are odd quirks and inconsistencies in how the hand-written parser handles the grammar: in templates that describe areas to expand, you have conditional logic handled through #if statements, with '#return' to indicate that the syntax that follows belongs to the caller of the template (so the template reference would be replaced with it,) in some areas, the expression requires a semicolon to terminate the statement, and oddly in others it's not allowed. I haven't even started looking into why mostly because I plan on focusing on fleshing the generated parsers out so it can eventually replace the hand-written version.

Applicability & Layering

I totally agree that logic languages have massive potential for parsing, but don't be so quick to discard speed.

Two good reasons:

1) If something is faster, it's usable in more use cases and affords greater experimentation.

For example, an LL(1) grammar, such at those in many lisps, can easily be used as a data interchange format. Used carefully, the top-level of productions works for streaming too. Speed is critical here in a way that it's not for parsing code maxing out at a few million lines over a few tens of thousands of files.

2) Faster often (but not always) implies simpler in a way that correlates between man and machine.

For example, you could layer a language with three grammars: lexical (break words in to tokens), balanced (match up parens, brackets), and operators (infix, prefix, associativity, precedence, etc). Then you can use three different algorithms to parse them: lexing with finite state, grouping with a push down stack, and then pratt parsing.

It's of course a tradeoff, since scannerless parsers can compose full grammars, but you don't necessarily want or need that level of flexibility. It can be confusing to humans, just as it can be inefficient for machines. Instead, operator grammars compose neatly, so you can have extensibility at that level.

This is just one point in the design space, there are many others. I guess my point is that I'd like to see logic programming find it's way in to executable grammar formalisms, but I don't necessarily think we need to go full scannerless or abandon layering.

For example, you could layer

For example, you could layer a language with three grammars: lexical (break words in to tokens), balanced (match up parens, brackets), and operators (infix, prefix, associativity, precedence, etc)

This has been my approach for a few years now. I first started out with a lexer, brace balancer, and a precedence parser (my early SuperGlue demos work this way, I was also doing it in the Scala IDE), but have moved on to lexer, balancer, and recursive descent for my current work. The use of balancers is a no brainer because they only require lexical information to work while dramatically improving the robustness of the parser, especially with respect to error recovery!

People talk a lot about speed and expressiveness, but in language design those aren't the most important parsing issues at all. The problem is that parsers tend to be used in other contexts where those are important, and the issues get all muddled together. I've just come to accept that there are really two topics to consider: the one I'm interested in, parsing for programming languages, and then just parsing in general, which I don't find very useful for the problems I encounter in my own work. Of course, LISP people don't see it that way, but then LISP is not designed to be very user friendly anyways.

Of course, LISP people don't

Of course, LISP people don't see it that way, but then LISP is not designed to be very user friendly anyways.

What do you figure is different about how LISP people see it?

They see syntax as something

They see syntax as something to be kept minimal, powerful, and flexible. They want homoiconicity and are willing to pay for it in usability; as a side effect their parsers can be very fast because they aren't doing very much.

Interesting. As a Lisp person,

Interesting. As a Lisp person, the point I'd quibble with... somewhat... is "willing to pay for it in usability" — although Lisp goes for homoiconicity first, that doesn't mean one can't recover usability later. (This a nuance, really: the "pay for it" metaphor kind of comes across as more permanent.) Having got other things they want, many of them look for ways to make the syntax more ergonomic. They generally fail... but I do have some ideas about that. They quite commonly sink energy into trying to eliminate parentheses in small expressions, e.g. math expressions with operator precedence, whilst larger expresisons still use parentheses. I think that's basically backwards. I fully parenthesize math expressions even in languages that don't require it, because I find it's the only way to be sure of getting that fiddly stuff right; whereas for multi-line constructs imho it's more effective to use other techniques: typically, indentation combined with some sort of start/end keywords (I'm not a fan of C's braces but they do somewhat serve the purpose). I picture something with start/end markers and indentation for large multi-line constructions, then for smaller (typically single-line) constructs there's a separate set of rules that calls for each compound expression to be parenthesized and alternate between keywords and subexpressions; then it's just a matter of determining whether the keywords in a given compound expression are the even elements or the odd elements (I have in mind some simple rules to determine that within the first two tokens). Granted, fast parsing would be a side-effect of this sort of usability.

Everything needs to be built in

"... that doesn't mean one can't recover usability later."

They never, ever do.

For one thing, changes in basic syntax are a big change, and people want their code to be interoperable and understandable by other people who use the same system.

Maybe that's not the reason - maybe it's too involved a change, too hard to implement or implement well...

But I believe that you can't leave a single basic thing out of a language and expect people to roll their own - if programs are to be part of an ecosystem then "roll your own" means "outside the ecosystem".

I suspect interoperability

I suspect interoperability can be managed, in this case. It shouldn't be difficult to cope with two straightforward notations for a nested-list data structure.

My hope would be to provide one for all rather than having each roll their own. Avoiding the roll-your-own problem is kind of my point about making the syntactic structure extension-independent.

Usability for those using it

There's a curve of usability. I design Lisp-like syntaxes largely because I want the syntax to remain usable even under heavy doses of spec reimplementation (hence minimality). Metaprogramming often involves slight spec redesign, which necessitates reimplementation.

I see low value in homoiconicity and high value in infix operators, but I use prefix operators anyway because of how they interact with code indentation. As long as everything is in prefix order, an operator has good visibility by being less indented than all its arguments, so there are few reasons to ever indent by more than a few spaces. The less indenting I do, the more I can see on one line, and the more I can tolerate deeply nested code without breaking it apart into an artificially flat GOTO spaghetti.

Of course, indentation is a concern that's unique to editing large, nested programs in a plain text editor. For natural language documents with small inline math formulae, infix is still a nice way to go.

Indirect Left Recursion

I think I might have a solution that works for most cases.

Given it's continuously parsing left rules, it does an attempt, if it fails, it swaps the bad attempt from the stack with the last good attempt. I can detect recoverable bad parses by checking the swapped symbol's length against the one that was invalid on the stack. If the good symbol is smaller, I shouldn't need to perform any stack cleanup because the predictions take care of detecting deterministically what path should have been valid.

In cases where it detects a no-alternative route which is left recursive, it first does a check on the current symbol on the stack to see if it's a wrapped up form of the currently parsing rule, to resolve issues where A requires B requires A and the outermost A already parsed once, the inner A will borrow that context and return, allowing B to have a valid reference to find the symbols which should come next. A little state on the swaps to ensure they don't try to repeat this process and it seems to work.

This logic only shows up where that kind of scenario occurs and it only works if the borrowed context is wrapped in shallow 'one symbol' wrappers. So in the grammar parsing language the logic is entirely absent.

Breadth First Search

There is a parallel here with search strategy in a Prolog program. In this case it is the depth first nature of the search strategy that is causing the problem. You could use a breadth first strategy, which involves looking one level down for a match before choosing the alternative. So given:

A ::= B | C

You try both B and C before deciding which branch to take. The important thing is to try only one rule without further chaining, so:

B ::= A '[' ']'

Gets deferred as it cannot return an answer until 'A' has been evaluated.

C ::= "Other"

Will succeed if the input matches "Other", otherwise we continue the chaining of the "B" rule.

Breadth first has the disadvantage of using more memory to store the intermediate results.

Iterative Deepening might be interesting to look at as well.

Similar

I make heavy use of Deterministic Finite Automata in the prediction methods, what I basically do is (try to) detect cases where recursion is likely problematic and switch those portions of the automata off. So in the case of the recursive B, it would switch the detection of B off and use the 'A' symbol that results to do another prediction. Obviously it's a work in progress, sometimes it'll add the flag to turn off paths, but... for some reason it isn't 100% yet (and I end up with a flag that's never used, but it is defined.)

ANTLR

ANTLR generates parsers for LL(*) grammers up to v3, the new one is referred to as Adaptive LL(*).

It looks like you're reinventing packrat parsing ...

... the hard way, "like cutting your throat with a butter knife". Just google it.

top down grammars don't compose that well

precedence relations compose effortlessly though.

I would like to see what could be done defining computer languages entirely out of precedence grammars. It should be easy, since prolog was basically that, with the ability to add new operators on the fly.

Since they DO compose so well, I don't see why they haven't become the standard for defining programming languages.

After years of dabbling with

After years of dabbling with extensible syntax, I concluded that extensible syntax can only be a practical tool if a programmer can look at any code written in the language and know instantly how to parse it /without/ having to first know what syntax extensions are in effect. For single-line expressions, you pretty much /have/ to fully parenthesize; though you don't have to go full-Lisp: you can support a mixture of pre/in/postfix operators, for example by introducing a lexical convention to distinguish between atoms and operators.

Commutative and Associative

I would go further and would expect operators to globally observe the expected properties, like commutativity and associativity, according to the common usage for operators like +, * etc.

Hm. I wouldn't bet on even

Hm. I wouldn't bet on even a given standard for floating-point arithmetic having those properties.

Floating point

Maybe floating point should not use these operators then? Knuth uses circled operators (TAoCP, vol. 2, sec. 4.2.1, pp. 214 - 223) because they are not exact.

Interesting thought; though,

Interesting thought; though, exactness is a separate concept. For example, quaternion multiplication, even when exact, is not commutative; and, to carry the point to its extreme, octonion multiplication isn't even associative.

I couldn't disagree more

Domain specific notations are useful.

Notice I said "notation" not "language". I'd like the ability to escape into and out of contexts that have different rules, different notations. I don't see why a program couldn't use as many notations as their are kinds of data or methods of programming...

Also questions like "can you assume '+' is associate and commutative" could be context dependent you could have a section of code bracketed by "assume commutative, associative, reflexive and distributive law for + and *"

I'm quite sympathetic to a

I'm quite sympathetic to a desire for domain-specific notation. My own conclusion, after pursuing it as a background interest for about twenty years, has been that it needs to be done in a way that permits extension-independent parsing. I suspect this can be done tolerably well in practice, but the only way to know for sure is to gain practical experience using such a system, and I have not yet had that opportunity. I've a distant memory someone who had had more such opportunity had also come to this conclusion, that the parse needs to be extension-independent; perhaps I'm remembering imperfectly, but if pressed, the place to look for it would be in the proceedings of one of the two extensible-languages symposia from circa 1970.

Editor support

My preference is to have + as the keystroke that selects from one or more overloaded symbols which have different visual renderings in the editor and different logical guarantees. I think syntax extensions can be ambiguous/conflicting as long as such ambiguities are flagged at edit-time and the syntax has a way of preserving any ambiguity resolution made. Also, intended overload bindings should be preserved by introduced ambiguities. In other words, if you import new syntax that would render some of the constructs in a scope ambiguous, then those constructs should remain bound to whatever they meant before the import (as a structural editor would behave).

editable views and extension-independent parsing

A while back we discussed issues of readable code and text vs. richer media, and Matt M suggested that a text encoding might be a skin (an editable view) for a more structured encoding. I have experimented with this idea, and I recently selected one variant for a simple Forth-like command language above my bytecode. TLDR: I persist in a bytecode format but can parse that back into the command language for rendering and editing. Edited code can be expanded back into bytecode. The parse could be performed by a pluggable editor, or by web service, or by a FUSE filesystem extension.

Extension-independent parsing emerges naturally, at least in the sense that we can process and share the bytecode without knowing syntax extensions. Syntactic abstraction is pushed into the editable view layer, and it's not a problem if some editors or views have different features than others. For example, one view might support rational numbers, such that `4/10` expands into (and parses from) `4 10 ratio`. If another view doesn't know about rational numbers, the user might still view and edit `4 10 ratio`. Extensions are feasible for vectors, matrices, monadic notations, conventional expression of math formula, etc..

For structure editors, editable widgets are also possible. For example, if we see `46 52 54 rgbcolor`, a view (for a compatible editor) could recognize this and render it as a box with a color in it that we can click upon to use a color editor. Sliders, checkboxes, dropdown lists, graphs, canvases, and more are viable. Though, with larger media (big graphs, for example), our ability to present a reasonable text experience will diminish (because size is a big factor on readability). This structure-via-parsing can replace much structure-via-staging that I was pursuing earlier. I believe the two approaches have different strengths and may fill different niches.

Hmm.

I have always supported having a grammar-free stage for all code.

Ie you have the human readable stage, but you also have an intermediate s-expression format so that code walkers don't have to know any syntax.

Now you're suggesting that the code should have code-to-code transformers and be stored in intermediate stages of less grammar? Or I suppose in multiple formats so that editors that know the full grammar can allow it to be edited.

:/ I don't see much use in having formats that know some grammar but not all. Parsing isn't that hard and specifying grammar extensions isn't any less vital than specifying type interfaces. No one suggests that we need to make it easy to edit code when you don't know what any of the libraries are, why should not knowing any of the syntax extensions be privileged?

[edit]
On second thought, if all custom syntax and macros can be stripped then you have your code transformed to take out all of the syntactic abstraction, you've turned your compact, hopefully readable code into Java. Something that confuses the eye, but requires less knowledge of the project to read. I guess there's a use for that but I would like to avoid working for any corporations who want programmers to work at that level of lacking abstraction. It's like limiting programmers non-toxic to crayons and toddler scissors because you assume that you hired idiots.

grammar-free stage

Expansions are frequently hierarchical. `6.02e23` might expand to `6.02 23 exp10` then to `602 2 decimal 23 exp10` which ultimately expands to bytecode `#602{%integer}#2{%integer}{%decimal}#23{%integer}{%exp10}` (which may further expand at link-time). I store this (more or less) grammar-free bytecode. Back-end processes (interpretation, typechecking, etc.) don't know anything about DSLs or syntax extensions. However, I do need a function that works backwards from the bytecode to the original number so the users can easily view and edit code as they originally wrote it. This role is fulfilled by a simple, unambiguous, context-free parser on the bytecode.

Use of S-expressions would not change much. You'll either have exponential notation as a primitive feature of your S-expr language or you'll expand it to something like `(exp10 (decimal 602 2) 23)`. And even those parentheses aren't a major distinguishing characteristic. For example, I might use {%list-begin} and {%list-end} for encoding a list, which gives me semantic parentheses, which might be rendered as parentheses in text.

don't see much use in having formats that know some grammar but not all

If we have any heterogeneous or distributed development of extensions, this condition emerges naturally. E.g. if I did not provide a standard for rational numbers, then Alice and Bob might independently come up with `4/10` as meaning `4 10 ratio` and `10 4 rational`. It's important that Alice and Bob can share code without damaging semantics. It's acceptable if Alice sees `10 4 rational` when viewing Bob's code because, to Alice, `4/10` means `4 10 ratio`.

The same holds for date-times, color pickers, music notation, kripke diagrams, etc.. De-facto standards can unify common extensions, so everyone eventually agrees on how to present and view rational numbers and color pickers and so on. But there will always be utility for more domain-specific extensions.

No one suggests that we need to make it easy to edit code when you don't know what any of the libraries are, why should not knowing any of the syntax extensions be privileged?

I don't intend to suggest otherwise. If you don't know the syntax extension for decimal numbers, you simply get the expanded view, e.g. `120 2 decimal` instead of `1.20`. You could write `120 2 decimal` either way, and a good structure editor might enable you to expand or reduce code in place.

you've turned your compact, hopefully readable code into [..] something that confuses the eye, but requires less knowledge of the project to read.

In my case, I've turned it all into Awelon bytecode. Which is very confusing to the eye in quantities larger than about 50 characters. But outside of special circumstances you aren't even supposed to try reading it. If you want to edit the bytecode, you convert it back to regain the syntactic abstraction then edit code as you wrote it (modulo lost spacing and the occasional written-form `4 10 ratio` turning into `4/10` for editing). That's the nature of an 'editable view', cf. lenses.

Macros aren't a good fit here because you cannot reliably translate a macro expansion back to a useful macro expression. But we don't need the full power of macros. Syntactic abstraction is simply pushed into the editable view layer, which is flexible and extensible enough to meet domain specific needs. And a lot can be done with functions and compile-time evaluation.

That very low level seems unnecessary

Passing around bytecodes and disassembling them doesn't seem like something programmers need to do much. Most languages leave that level out.

Not necessary.

Most languages don't support domain specific notation. The entire subject is unnecessary. Real programmers use... :)

Passing low level, uniform bytecode around and disassembling as needed has advantages for simplicity and portability. S-expressions would have similar advantages for this role, but I favor bytecode for some other roles. Bytecode isn't the only option, but it's the best option I've found for my requirements and desiderata. And disassembly can be cheap, fast, reliable, and even implicit if you've designed for it.

Too many bytecodes formats are not meant for interchange

Even when bytecodes of some sort are available, a lot of implementers don't actually design them to be portable, they're just an intermediate format and they end up being closer to the compiler and the target machine than you want for abstract work.

For instance, they usually use the byte-ordering of the underlying machine. And as compilers do more optimization, they may add more optimized bytecodes or change the old ones

If you want a programming

If you want a programming language that's good for human use, the only way to get it, in the long run, is for the ground representation of programs to be human-oriented. Otherwise, you'd need software to perform transformations on the ground representation to support the language's human interface; and so the human interface would be a computer's transformation of a computer-oriented representation. (I'm reminded of Isaac Asimov's Caves of Steel, where it's suggested that if you have robots terraform a planet for you, they'll create an environment that's good for themselves; so if you want that environment to also be good for humans, you need the robots to be humani-form.)

I generally interpret this to mean that the ground representation should be text, which is, after all, essentially the static representation that has evolved over thousands of years particularly for human use.

re: human-oriented

I'm not convinced of your "only way" assertion. Most things humans use lack human-oriented ground representations (quantum physics and chemistry don't seem to be human-oriented by any reasonable interpretation). And even manipulating plain text with a computer involves many software transformations after you account for everything from filesystem blocks to rendering in a video display.

But, what exactly do you mean for programs to be 'human-oriented'?

Awelon project has a vision and a few principles that guide its design in the interest of unifying PX and UX. These include:

  • long-lived behaviors have personal, human-meaningful representations in the environment
  • modifying this representation has a corresponding impact on the long lived behavior
  • human actions (gestures, etc.) are modeled as streaming code that manipulates the environment
  • failure is ideally very coarse-grained; partial failure has too many possibilities to comprehend

These UX principles shape a lot of my PL design and the libraries and concepts I've been building around it. This includes my selection of a streamable bytecode, rather than text, as the ground representation for software, human actions, and objects in the environment.

It seems to me there are plenty of ways to be 'human-oriented' by most reasonable interpretations without mandating text as a ground representation. Indeed, ANY design aimed toward UX is arguably human-oriented.

In the linked post, early on

In the linked post, early on I see a list of four things, claiming to elaborate the idea you're excited about, and my best guess as to what these four things mean suggests it's pretty similar to the basic premise of my theory of abstraction. Except that seems unlikely, so I figure maybe I need to reserve judgement on what that part means, and I read on, hoping for clarification. But things just keep getting more unclear to me; and then I hit the following sentence:

A weakness of tacit concatenative programming is that, in a traditional text-based programming environment, users must visualize the environment (stack or other structure) in their head, and that they must memorize a bunch of arcane 'stack shuffling' words. By comparison, variable names in text are easy to visualize and review.

I have no working theory as to what that sentence means. :-(

Try programming in Forth.

Try programming in Forth. You will find yourself visualizing the stack, in your head - e.g. to know which argument is on top, which is in the second slot, and so on. And you will memorize a bunch of arcane 'stack shuffling' words - e.g. swap, over, pick, tuck. Do the concrete examples help?

Substitution-based programming models (such as lambda calculus) use symbols like `x` and `y` to carry information from where they are assigned to where they are used. This provides simple visualization of the dataflow, and is a feature that Forth and other tacit languages lack.

Obviously, it would be useful to gain similar visualization from Forth, to reduce the burden on the brain. But, since it's not part of the language, it would need to be part of the programming environment.

Okay, now I have a working

Okay, now I have a working theory; if I'm right, the passage would have made sense to me with "and that" replaced by "whereas in a tacit concatenative language", producing

A weakness of tacit concatenative programming is that, in a traditional text-based programming environment, users must visualize the environment (stack or other structure) in their head, whereas in a tracit concatenative language they must memorize a bunch of arcane 'stack shuffling' words. By comparison, variable names in text are easy to visualize and review.

I would then pretty much agree with the passage.

You've inserted a "whereas"

You've inserted a "whereas" in between two clauses that I'm pretty sure describe the same situation. Try this paraphrase instead (which also incorporates David's next paragraph):

Admittedly, in a traditional text editor, variable names have an advantage over tacit concatenative programming: Variable names are easy to visualize, whereas tacit concatenative programming requires the programmer to visualize the environment in their head and memorize arcane "stack shuffling" words.

My answer: Use something other than a text editor!

This is silly silly stuff

Are you going to model each function as some sort of fluidic black box with hoses coming in and out of it?

Maybe you should plug hoses into boxes in real life to represent every function. Your program could be a tangle of toys with hoses running all over your office.

:/ why not name variables so that you can keep track of them and use text because it works?

With a broad caveat about

With a broad caveat about aspiring to critique non-dismissively — yes, your point about naming things is kind of what I was struggling to articulate. Naming things works because it's what humans are good at: language. Symbolic language isn't just an arbitrary way of representing information, it's a big part of the shape of complex human thought. Seems to me the claim that "a picture is worth a thousand words", besides being exaggerated, is only true in specialized situations; it doesn't follow that any arbitrary set of words can profitably be replaced by a picture. For general purposes, imho the most efficacious representation of information for humans can't help being symbolic language.

Concatinative languages

I actually think the idea of making functions composable is interesting and powerful. It may be the wrong answer for a lot of problems, but it's still potentially powerful and interesting.

Trying to fix its problems with a graphical interface sounds like a bad idea though. Graphical interfaces seem to me to lower the level of abstraction and be less powerful, and harder to write and edit too.

visualization =/= visual programming

We can add views of stack, inferred types, running tests, etc. while still primarily editing through text. This requires something other than a text editor, but only for presentation purposes. By automating visualization, we reduce burden on human users to visualize a stack or other environment in their heads, and thus mitigate a weakness of tacit programming.

Visual programming is something entirely distinct, the editing of a visual representation as the medium for modifying program behavior. Conventional visual programming with boxes and wires tends to be limited because wiring is difficult to abstract. However, visual programming can achieve high levels of abstraction in the form of visual DSLs. Visual programming is mostly useful for problem-specific use cases anyway and so is a good fit for DSLs.

I believe there are roles for both. But visualization, not visual programming, was my focus in the block quotes above.

perspective?

I recall several years ago you were talking about "crutches"; the concept, I was under the impression, was of techniques that address the symptom of a problem rather than addressing the problem itself. A concern here, I think, is that visualization may be in this case acting as a crutch; it's not that concatenative languages aren't of interest nor that visualization isn't of interest.

re: crutches

In medicine, a crutch is something that supports you but also ties up a hand and hinders you from running. In PL design, a crutch is a feature that may help for some use cases but ultimately interferes with productivity.

So, addressing a symptom isn't inherently a crutch. But some means of addressing a symptom might be crutches.

Profiling code to find hot spots, static analysis and fuzz testing to find bugs, automatic visualization to help comprehension and maintenance, and many other technologies can greatly augment productivity without holding us back. Of course, I'm sure there are bad ways to do anything.

Not human oriented

I know what concatenative programming is, and one thing it's not is human oriented. Stack machines are a convenient low level device, they're great for vms. But try to change a Forth program once it's written... It's horrible beyond horrible.

My experience with Lua has convinced me that having a variable number of return values (not even predetermined per function like Scheme) is quite handy, but in that case, only the return values are unnamed.

Having unnamed arguments doesn't so much allow you to compose functions as give one arbitrary composition priority over others.

Actually I rather like the idea of programming where each argument has to be/can be named on both the caller side and the receiving side. That way function calls are self documenting and you don't have to remember argument order.

The idea of a visual representation is pretty much the idea that arguments have names/labels, it's just that those labels are kept separate from the text of the code. See the scatter/gather thread.

stack =/= concatenative

Don't conflate concatenative programming with stack machines. A lot of higher level structure becomes available by simply using a more flexible environment than a stack of numbers. And concatenative PLs don't need to be low level; they can be higher order, purely functional, or even constraint/logic declarative. Forth is a fine and widely known example of a concatenative PL, but you can't generalize much from a single example.

The idea of a visual representation is pretty much the idea that arguments have names/labels, it's just that those labels are kept separate from the text of the code.

I don't see how this is the case, at least not in general. Perhaps you're envisioning some particular representation that has this property?

re: too many bytecodes

I'm not using a randomly selected bytecode format. I'm using my bytecode for Awelon project, Awelon Bytecode (ABC). It's high level as bytecodes go: simple, purely functional, easily secured, composable, decomposable, streamable, and very much intended for interchange. It's even human comprehensible in small doses. And byte-ordering is a non-issue. ABC is a subset of UTF-8 text, and there are only 42 primitives which fit easily into the ASCII subset of UTF-8.

Your system sounds neat

It's acceptable if Alice sees `10 4 rational` when viewing Bob's code because, to Alice, `4/10` means `4 10 ratio`.

But presumably Bob would distribute his syntax and notation together with his definition of rational. So you could also render this as '4/10' for Alice, but maybe stylize the slash in some way or add a subscript to distinguish it from her slash. One thing the system can help you with is making sure that distinct syntaxes are visually distinguishable.

color-forth

Yes. If Bob's syntax is provided as a module, he could easily share it. And Alice could perhaps compose the notations using colors or styles or subscripts or similar to distinguish them. However, I haven't explored this very much. My goal with my command language was efficient writing of code in a command line or REPL context, so the ability to display multiple ambiguous variants of a syntax extension hasn't been a priority.

I'm inclined to favor a technologically simple option: turn extension proposals into a community discussion, standardize popular extensions, rewrite code as needed, etc..

Widgets in code

"For structure editors, editable widgets are also possible."

I want to write a language that MANDATES that dialog boxes to choose between more complex choices in user interfaces are a standard part of the language. Ie, it mandates a standard format for widgets that a conforming editor has to support.

A pure text editor should still work.

Modifiers

Or you could have consistent modifiers, for example + could be normal commutative and associative addition, and +' could be non-associative addition.

Woo~

In semi-related news, I think it just parsed its own grammar description files using the grammar to describe them.

Now comes the fun process of verifying the results, and fixing nagging issues like odd grammars that are simpler but fail anyway due to how they're written.

I've got a lot more work to do on this, but this still manages to put me in a good mood.

My next milestone is getting it capable of parsing a real language, like C#.

That will be a bigger challenge, me thinks, but that's part of the fun. :)