minor lexical tokenization idea via character synonyms

A few days ago I had a tokenization-stage lexical idea that seems not to suck after going over it several times. On the off chance at least one person has similar taste, I can kill a few minutes and describe it. Generally I don't care much about syntax, so this sort of thing seems irrelevant to me most of the time. So I'll try to phrase this for folks of similar disposition (not caring at all about nuances in trivial syntax differences).

The idea applies most to a language like Lisp that over-uses tokens like parens with high frequency. I'm okay with lots of parens, but it drives a lot of people crazy, apparently hitting some threshold that says gibberish to some folks. Extreme uniformity of syntax loses an opportunity to imply useful semantics with varied appearance. Some of the latent utility in human visual system is missed by making everything look the same. A few dialects of Lisp let you substitute different characters, which are still understood as meaning the same thing, but letting you show a bit more organization. I was thinking about taking this further, where a lot of characters might act as substitutes to imply different things.

The sorts of things you might want to imply might include:

  • immutable data
  • critical sections
  • delayed evaluation
  • symbol binding
  • semantic or module domain

A person familiar with both language and codebase might read detail into code that isn't obvious to others, but you might want to imply such extra detail by how things look. Actually checking those things were true by code analysis would be an added bonus.

I'm happy using ascii alone for code, and I don't care about utf8, but it would not hurt to include a broader range of characters, especially if you planned on using a browser as a principle means of viewing code in some context. When seen in an ascii-only editor, it would be good enough to use character entities when you wanted to preserve all detail. It occurred to me that a lexical scan would have little trouble consuming both character entities and utf8 without getting confused or slowing much unless used very heavily. You'd be able to say "when you see this, it's basically the same as a left paren" but with a bit of extra associated state to imply the class of this alternate appearance. (Then later you might render that class in different ways, depending on where code will be seen or stored.)

A diehard fan of old school syntax would be able to see all the variants as instances of the one-size-fits-all character assignments. But newbies would see more structure implied by use of varying lexical syntax. It seems easy to do without making code complex or slow, if you approach it at the level of tokenization, at the cost of slightly more lookahead in spots. As a side benefit, if you didn't want to support Unicode, you'd have a preferred way of forcing everything into char entity encoding when you wanted to look at plain text.

Note I think this is only slightly interesting. I apologize for not wanting to discuss character set nuances in any detail. Only the lossless conversion to and from alternatives with different benefits in varying contexts is interesting to me, not the specific details. The idea of having more things to pattern match visually was the appealing part.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

In lisp terms

it seems that what you're describing is effectively block-level reader macros.

The problem is always the keyboard

The barrier nowadays to using non-ASCII characters in programming languages and their relatives is the inability of the dominant US keyboard (which is used in many countries besides the US) to make them easy to enter. Displaying the Unicode character repertoire is no problem, but it can't be displayed until it's been entered.

(Plug: I've written and published the Moby Latin keyboard for Windows, which is drop-in compatible with the US keyboard except for AltGr (the right Alt key), which allows access to almost a thousand additional characters using mnemonic constructions. There is also a set of UK variants collectively known as Whacking Latin.)

Just use ligatures, it isn't

Just use ligatures, it isn't hard, and the mapping from keyboard input to character is extremely direct.

You can enhance

You can enhance code rendering pretty easily with some syntactic analysis. You don't have to limit yourself to an old fashioned teletype terminal ala emacs or vim. I've been doing this in my last few language work; I.e. http://research.microsoft.com/en-us/projects/liveprogramming/typography.aspx And http://research.microsoft.com/en-us/um/people/smcdirm/apx/index.html.

Colour works well for separation

Prolog can have the same issue with parens that lisp does: too much use clouds meaning. I found that colour coding works well, and can be handled by an editor's syntax highlighting. My .vim syntax file gave different matching pairs of brackets different colours. Similar idea: visually separate repetitions of the same glyph, but it was handled automatically in the editor. The colour coding was arbitrary, so it did not attach any extra semantic info. The glyph-mapping that you are describing could be somewhat fragile: it does not checked anywhere, so it is up to the user to keep it in sync with the meaning of the code.

re: color coding

Color coding is also a good idea, somewhat orthogonal to varying the characters filling a role. I don't actually have much to say here, except redundant expansion, repeating points differently. I could use tiny dialogs like this one as motivation:

Stu: You plan to support Unicode, of course.
Wil: No way in hell.
Ned: Unpopular attitude.
Stu: What a jerk! I'm not following you any more.
Wil: And how would I notice that?

Suppose several languages I want to use together work fine in ascii: I could ignore utf8 and shrug insolently in response to complaints. Or if I think of an excuse to want more characters, I have incentive to permit more than one way to encode characters, with trivial extra cost — perhaps none when only ascii appears. If an octet has high bit set, I can parse a utf8 charater, or if I see an ampersand, I can see if this is followed directly by a char entity then a trailing semicolon. (I don't use existing tools to parse.) Then folks can put whatever they want in string constants; further, a lot of browser text markup becomes valid code input when pasted directly into a code editor.

When syntax is Lisp, I'd like more delimitors, so scratching that itch is an excuse to deal correctly with utf8 and to accept char entities too. A large increase in usable operators would be nice as well. Coloring at presentation time would also improve code legibility. But I spend a fair amount of time cutting and pasting code in different places, but most destinations I use won't syntax color and it's a waste of time to do it by hand. (Typically I go grayscale in email, darkening important things and lightening trivial detail like comments.) More structure hinting seems better. I like working with ASTs, and a tree of pairs is fine, but showing different kinds of lists using different delimitors would improve readability for me. I can just as easily ignore the difference between delimitors and see them all as parens (or quotes in the case of strings).

The glyph-mapping that you are describing could be somewhat fragile: it does not checked anywhere, so it is up to the user to keep it in sync with the meaning of the code.

That's why I mentioned checking they were true in code analysis. For example, suppose the use of one open delimitor implies a special form is inside. Not only must the matching close delimitor appear as expected, you also generate a warning if it's not a special form.

In some cases, a delimitor might alter the grammar, say to begin and end Smalltalk syntax, or to enable operator precedence or infix notation. But to avoid making everyone learn everything, a tool could rewrite a bit of code to standardize on exactly one way of doing it, thus eliminating variants you don't care to know about. I'm in favor of viewing all the meaningful nuances as AST differences, with syntax meaning nothing itself, except in so far as you get a different AST after a change.

I suppose two devs, Jim and Ivy, won't like each others choices of delimitors, so remapping them would be nice. And some folks like sparing effects, while others like baroque density of annotation. Dialing code into your comfort zone seems better than being repulsed at someone else's style. Hypothetically, when studying a particular issue, it might be nice to dial down everything except one semantic angle you want emphasized. It seems easy enough to have a terse notation to declare what characters mean locally when they get remapped from standard notation.

Using utf, although it usually seems like a good idea

opens the door to homoglyph attacks as exemplified by things like mimic

loot chests gone wild

I'm inclined to allow declarations that limit or forbid where things can appear in code. For example, you might forbid definitions of methods specialized on a type outside a particular scope like this file here. Being overly restrictive at first seems a good idea (outlaw everything). When reasoning about how code behaves, it helps to know certain interesting cases are forbidden, so you need not look for them. A histogram view of all non-ascii codepoints used would help provide oversight.

Exceptions to a general rule can be white or blacklisted. One might outlaw all non-ascii codepoints in static code other than an explicitly whitelisted set. One might also require some codepoints appear only as character entities with unambiguous names. (Might have to declare those entities.)

There's plenty of things I don't like about Unicode. But inflexibly forbidding things would encourage work-arounds increasing complexity.