archives

programming languages with full-unicode syntax and identifiers are surprisingly hard to do well.

I am working on a programming language. I want to fully integrate unicode - preferably the NFC/NFD repertoire, with "canonical" decompositions only.

At the same time I don't want the bidi algorithm to be used to display code in a deceptive order on the page. But, in order to achieve this, I have to require LTR control characters in the program text after every RTL character where the following character is bidi "neutral" or "weak." Is that a mode I can set in any programming editor in wide use, or do I have to implement my own editor? Adding LTR controls in a separate step (like with a sed script or something) means there's an extra step I have to do before I see while editing, the same version of the code the compiler will be seeing.

At the same time I don't want "lookalike" characters used to display deceptive identifiers. Nobody can tell by looking whether 'a' is Latin or Cyrillic, or whether 'A' is Latin or Greek, and I don't want programmers tearing their hair out trying to understand why the variable they thought they just initialized is holding some value set by malevolent code somewhere out of sight, or why a perfectly innocent "Jane_Doe" keeps getting blamed for the fraudulent transactions of someone else whose name appears to be spelled exactly the same. The most straightforward precaution here is to ban identifiers that contain alphabetic characters from more than one script, but it seems a lot like using a sledgehammer to kill cockroaches. A less restrictive rule would allow mixing scripts but not if you use any letters which are confusable between those scripts - for example you could mix Latin and Cyrillic if you do it without using any character that looks like "a" (or other characters that could be either) or you could mix Latin and Greek if you do it without using any character that looks like "A" (or "B", or "X", or other characters that could be either). But this makes the identifier syntax rules complicated to check and hard to easily express.

Just two of the *MANY* issues that need to be addressed in order to allow a fully unicode-enabled programming language that's not a security or usability disaster.

I used to hate Unicode a lot more than I still do. These days I recognize it as a massive hairball, but I'm not really angry about it any more; it's just one more piece of legacy design that clearly was NOT intended for the kind of use I want to make of it. So it's massively difficult to use, leaky, flabby, illogical, promotes insecurity worse than a language without array bounds checking, etc, but I guess I've just come to accept it and I'm finally getting around to trying to overcome the problems and try do something worthwhile with it anyway.