User loginNavigation |
minor lexical tokenization idea via character synonymsA few days ago I had a tokenization-stage lexical idea that seems not to suck after going over it several times. On the off chance at least one person has similar taste, I can kill a few minutes and describe it. Generally I don't care much about syntax, so this sort of thing seems irrelevant to me most of the time. So I'll try to phrase this for folks of similar disposition (not caring at all about nuances in trivial syntax differences). The idea applies most to a language like Lisp that over-uses tokens like parens with high frequency. I'm okay with lots of parens, but it drives a lot of people crazy, apparently hitting some threshold that says gibberish to some folks. Extreme uniformity of syntax loses an opportunity to imply useful semantics with varied appearance. Some of the latent utility in human visual system is missed by making everything look the same. A few dialects of Lisp let you substitute different characters, which are still understood as meaning the same thing, but letting you show a bit more organization. I was thinking about taking this further, where a lot of characters might act as substitutes to imply different things. The sorts of things you might want to imply might include:
A person familiar with both language and codebase might read detail into code that isn't obvious to others, but you might want to imply such extra detail by how things look. Actually checking those things were true by code analysis would be an added bonus. I'm happy using ascii alone for code, and I don't care about utf8, but it would not hurt to include a broader range of characters, especially if you planned on using a browser as a principle means of viewing code in some context. When seen in an ascii-only editor, it would be good enough to use character entities when you wanted to preserve all detail. It occurred to me that a lexical scan would have little trouble consuming both character entities and utf8 without getting confused or slowing much unless used very heavily. You'd be able to say "when you see this, it's basically the same as a left paren" but with a bit of extra associated state to imply the class of this alternate appearance. (Then later you might render that class in different ways, depending on where code will be seen or stored.) A diehard fan of old school syntax would be able to see all the variants as instances of the one-size-fits-all character assignments. But newbies would see more structure implied by use of varying lexical syntax. It seems easy to do without making code complex or slow, if you approach it at the level of tokenization, at the cost of slightly more lookahead in spots. As a side benefit, if you didn't want to support Unicode, you'd have a preferred way of forcing everything into char entity encoding when you wanted to look at plain text. Note I think this is only slightly interesting. I apologize for not wanting to discuss character set nuances in any detail. Only the lossless conversion to and from alternatives with different benefits in varying contexts is interesting to me, not the specific details. The idea of having more things to pattern match visually was the appealing part. By Rys McCusker at 2015-10-31 02:12 | LtU Forum | previous forum topic | next forum topic | other blogs | 3864 reads
|
Browse archives
Active forum topics |
Recent comments
20 weeks 1 day ago
20 weeks 1 day ago
20 weeks 1 day ago
42 weeks 2 days ago
46 weeks 4 days ago
48 weeks 1 day ago
48 weeks 1 day ago
50 weeks 6 days ago
1 year 3 weeks ago
1 year 3 weeks ago