programming languages with full-unicode syntax and identifiers are surprisingly hard to do well.

I am working on a programming language. I want to fully integrate unicode - preferably the NFC/NFD repertoire, with "canonical" decompositions only.

At the same time I don't want the bidi algorithm to be used to display code in a deceptive order on the page. But, in order to achieve this, I have to require LTR control characters in the program text after every RTL character where the following character is bidi "neutral" or "weak." Is that a mode I can set in any programming editor in wide use, or do I have to implement my own editor? Adding LTR controls in a separate step (like with a sed script or something) means there's an extra step I have to do before I see while editing, the same version of the code the compiler will be seeing.

At the same time I don't want "lookalike" characters used to display deceptive identifiers. Nobody can tell by looking whether 'a' is Latin or Cyrillic, or whether 'A' is Latin or Greek, and I don't want programmers tearing their hair out trying to understand why the variable they thought they just initialized is holding some value set by malevolent code somewhere out of sight, or why a perfectly innocent "Jane_Doe" keeps getting blamed for the fraudulent transactions of someone else whose name appears to be spelled exactly the same. The most straightforward precaution here is to ban identifiers that contain alphabetic characters from more than one script, but it seems a lot like using a sledgehammer to kill cockroaches. A less restrictive rule would allow mixing scripts but not if you use any letters which are confusable between those scripts - for example you could mix Latin and Cyrillic if you do it without using any character that looks like "a" (or other characters that could be either) or you could mix Latin and Greek if you do it without using any character that looks like "A" (or "B", or "X", or other characters that could be either). But this makes the identifier syntax rules complicated to check and hard to easily express.

Just two of the *MANY* issues that need to be addressed in order to allow a fully unicode-enabled programming language that's not a security or usability disaster.

I used to hate Unicode a lot more than I still do. These days I recognize it as a massive hairball, but I'm not really angry about it any more; it's just one more piece of legacy design that clearly was NOT intended for the kind of use I want to make of it. So it's massively difficult to use, leaky, flabby, illogical, promotes insecurity worse than a language without array bounds checking, etc, but I guess I've just come to accept it and I'm finally getting around to trying to overcome the problems and try do something worthwhile with it anyway.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Unicode Strings and point mutation.

Most people implementing Unicode in programming languages treat Unicode codepoints as characters, the same way we treated ASCII codepoints. But changing capitalization, changing a diacritic, etc, changes the length in codepoints of a string. Conceptually these are point mutations - they correspond to changing a single character on a page of text and shouldn't require re-copying the whole page, invalidating the indices of other characters on the page, invalidating references to the page, etc. There are a number of good, efficient, algorithms for doing things with strings which rely on point mutations that don't invalidate other references to the string. And we mostly don't dare attempt to use them any more.

In Unicode - particularly in variable-length encodings of Unicode such as utf-8 - they are length changing mutations, and if we insist on a representation of strings as contiguous arrays of characters they require re-copying the whole string, invalidating references, potentially reallocating, etc. So we've invented "string iterators" and a number of other things to try to abstract the location of characters. We've abandoned numerous GOOD, EFFICIENT algorithms that relied on point mutations which did not invalidate other references in the same string. This was a mistake.

I think string iterators, while useful, are an abstraction at the wrong level. We abstracted string references because we were dealing with our failure to correct our abstraction of string representation. The right answer isn't refusing to touch string references and assuming any mutation invalidates the whole string. The right answer is changing our string representation to preserve valuable point mutation properties that make handling strings more efficient.

So I think that a proper string representation has characters that are a uniform width. Strings ought to use a 32-bit character encoding, with entities of 0x10FFFF or less referring to complete characters expressible as Unicode-1 codepoints and entities of 0x110000 or higher being keys into a dictionary of complete characters expressed as unicode-2 or higher. Capitalization, diacritic change, etc, become point mutations again and references to particular locations within a string cease to be poisonous.

Strings are limited to use no more than 2^31-or-so distinct characters, without restriction as to how many codepoints each character may be or how many instances of each character may appear in the string.

This leaves a question about whether it's worthwhile to have a fixed-width primitive "character" datatype separate from strings. Offhand I'd say it's not. We can use the word "character" to refer to strings of length one, the same way we use "empty string" to refer to strings of length zero.

A primitive (non pointer containing, requiring no GC tracing) character type might be worthwhile but would necessarily be limited in the character repertoire it could contain. If we trace the space requirements of our string representation (64-bit pointer to a buffer of keys, at least one key, another pointer for a dictionary of entities, but dictionary left empty (pointer NUL) if the key is a member of unicode-1) we get 160 bits, which is enough space to store 7 codepoints. Unicode-7 is a good character repertoire, but the principle of least surprise says that if strings are supposed to be sequences of characters, I don't want to have a character type that can't be used to contain any character that can appear in a string. We'd get programmers trying to use it to hold entities from a string and then getting errors because they tried to store a unicode-12 character in it.

Unicode is OK I guess

Ray, it's hard to tell what you're arguing for and against here.

Unicode does a great job at defining code-points and binary formats for the same.

It does a pretty horrible job at almost everything else.

There are often multiple correct (but different) ways to encode the same exact same grapheme. And there are often many nearly identical graphemes that differ only in their designated Unicode region.

While there are many code-points, often it takes a sequence of 2, 3, 4, 5, 6, or more code-points (each conceptually a 32 bit value) to represent a single grapheme. Even worse, the order of those code-points within the grapheme is potentially variable.

But as long as you're working in only one sub-set of Unicode (e.g. Chinese) and you're not displaying, searching, comparing, or sorting text, Unicode is OK I guess.


Not really "for or against" here...more like "how to deal with."

I've ranted angrily about Unicode in the past. I'm trying to skip the ranting and the anger now and just do something with it. I may fail.

The thing abut using Unicode in a programming language is, why are you doing it? It's because you want to make something that's useful for people without conditioning whether it can be useful to them on what language they speak.

But Unicode is so polymorphic and complex that I don't really know whether that's possible. There are so many ways for things to look the same and be meaningfully different, and so many ways for things that are the same to LOOK meaningfully different, and so many "gotchas" that can mislead people, that it seems like the absolute ideal format for creating chaos and deliberately misleading code.

"Minimum effort" unicode just treats it as a huge additional stock of codepoints. Someone names a variable in Arabic, then when it gets displayed the rendering direction switches through all the following bidi-neutral and bidi-weak "syntax" characters, switching back again when the renderer displays the name of a routine that someone else named using Greek characters, and you get an expression that does not mean what *either* of these language users thinks it looks like it means and which is not in the sequence that *either* of these language users thinks it looks like it's in, and that's an attack surface and a failure mode. Or someone accepts non-normalized Unicode source code without calling it a syntax error and then, as you point out, there are dozens or hundreds of ways to write what is canonically the same character. Comparisons fail and length checks are inconsistent, and that's another failure mode. Or someone accepts bidi formatting characters as white-space, and then an attacker writes something that relies on viewers being unable to tell the difference between "This" and "sihT" because one of them is rendered in a right-to-left override. Or worse, inadvertent confusion may arise because "dog" is the same variable even when it looks like "god."

And that's before we even get to all the characters that look the same and are different, like uppercase Greek Alpha vs. uppercase Latin A.

Trying to avoid all these semantic pitfalls and minimize all these attack surfaces, without falling into the "this programming language only seems useful and reasonable to people who speak and write one specific kind of human language" trap, is the goal.

Totally agree

The BitC effort had a significant security focus, but since we wanted programmers around the world to feel at home we turned to unicode for identifiers. With a bit of a wrinkle, since the language distinguishes lexically (and perhaps unnecessarily) between operator identifiers and regular identifiers.

A number of security implications arising from confusable glyphs were known to us from the beginning, but they aren't really fixable if you want to support a global language base for identifiers. The alternative of sticking to ASCII or ISO-LATIN-1 wasn't attractive, because it kept us from linking to languages that did not share this limitation.

Given the polymorphism you identify, code points are about as close to a replacement for char as we're likely to see. Given the recent context of Shift-JIS, the original Unicode developers wanted very strongly to have a straightforward transcoding from existing character systems wherever possible, which led to many of the lookalike glyphs. When they got to the far east languages, international politics caused many lookalikes. So it makes sense how we got here, but it isn't a particularly good place to be.

Go chose the worst of both worlds, providing a significant degree of unicode identifier compatibility while managing to deviate from UAX31 for no apparent reason - perhaps because the validation tables were too large?

It's a hot mess. DNS hasn't been broken quite yet, but my sense is that's only a matter of time. With HTTP resource paths being international the value of holding DNS to ASCII seems small. Basically, that ship sailed.

Simple things should be simple. Unfortunately, international language encodings aren't simple.

DNS and unicode

Have you ever looked at how international encoding of DNS names are done in URLs? It uses Punycode, and it's a disaster.

Here's a good starting point to read up on this:

Identifiers and bidi.

Programming Language expressions are visually formatted, and the Unicode Bidi algorithm is not compatible with our visual layout conventions.

While it's reasonable to have a different visual layout, for example indicating nesting level with right margin indentation rather than left, it is not reasonable to have a visual layout whose criterion changes at every line boundary depending on what was displayed the previous line. Right-margin boundaries don't line up indicating that something is at the same nesting level as lines above and below that line up on a left-margin boundary.

Visual layout is also used to separate comments from code, with line comments usually on the right side and code on the left. While again it is reasonable to have a different layout with comments on the left and code on the right, it is not reasonable to switch convention from one line to the next depending on what was displayed on the previous line.

Finally unicode designates most "syntax" characters such as mathematical operators and symbols as being bidi-weak or bidi-neutral, and that means that sequences of them display in either direction depending on the direction of the most recent bidi-strong character displayed. This is reasonable for natural languages because natural languages have redundant syntax that makes it possible (usually) to instantly identify gibberish when the sequence is reversed. But it's poisonous for formal languages, because formal languages have dense syntaxes and reversing a subsequence often makes a perfectly valid meaningful statement whose meaning is wrong.

So we have a vocabulary of strong two-dimensional semantic hints that are disrupted by one-dimensional bidi changes, and we have sequence-based semantics that are often meaningful with incorrect meanings when subsequences are reversed.

Bidi should operate where it is applicable to sequences that are meaningful in human languages. The spelling of identifiers, the contents of string literals, and the contents of comments SHOULD definitely obey the bidi rules.

But multiple possible rendering sequences for syntax and keywords of code and the visual placement of comments are unacceptable sources of visual ambiguity and confusion. These things must not be influenced by the spelling of identifiers, the contents of literal strings, and the contents of comments.

Unicode has recommendations for programming language identifiers, in terms of character derived properties XID_START and XID_CONTINUE. The recommendation is that an identifier should consist of an XID_START character followed by zero or more XID_CONTINUE characters. Because both XID_START and XID_CONTINUE contain characters that are Left-to-right, Right-to-left, and Bidi neutral or weak, this would cause different identifiers to have different effects on the visual arrangement of surrounding code.

I propose modifying the recommendation by requiring an LRM (unicode LEFT-TO_RIGHT MARK) as a prefix for identifiers that do not otherwise start with a left-to-right character and as a suffix for identifiers which do not otherwise end with a left-to-right character. LRM is not otherwise allowed in identifiers.

This requires the LRMs in identifiers in some cases where they are redundant (for example identifiers composed of Bidi-neutral characters) but this form of the rule is preferred because LRM's are invisible except by their effect on rendering, and strings of bidi-neutral and bidi-weak characters that require complex state to render require complex state to place and check. There's no circumstance where we want an identifier to look visually different depending on rendering state in which it's found and no circumstance in which we want the spelling of an identifier to propagate an RTL rendering state to anything outside the identifier, so the simplest rule is best.

I don't expect programmers to consistently insert or visually check for invisible characters in their code, and any allowed variation in the placement of these invisible characters is another potential for visual confusion in the sense of things that look the same and are different. For both reasons handling this requirement must be automated. This rule requires minimal state and simplest code for the compiler/interpreter to check identifier syntax and requires minimal state and simplest code for IDEs to correctly and consistently add the LRMs to identifiers where required.

Have you looked at this? -- Unicode Identifier and Pattern Syntax

Well yeah...

If I hadn't looked at that I wouldn't have known the unicode recommendations for identifier syntax that I was responding to.

My point is that those rules fail to create identifiers that reliably identify things (by looking like what they are and being what they look like) when read by humans, and allow expressions to be rearranged (so that they also aren't what they look like, or look like what they aren't, when read by humans) just because an identifier has been spelled differently.

And that therefore TR31 should not be followed as-is.



May be send your comments to the Unicode folks?

Perhaps you can send your comments to the Unicode Standards folks so that TR31 can be improved? Most PL designers are not Unicode experts and they either stick to an ascii subset or do a half-hearted job (e.g. Go).

That would be too sensible.

Your reply makes me realize, objectively that would be the sensible thing to do.

Problem is I can't look at their work product (the standard itself) and think they'd actually give a shit about engineering feedback. They literally passed over EVERY chance to make a simple, easy, usable standard with few rules, efficient and simple handling, and unambiguous forms. They even passed over every chance to remain silent about something there was disagreement on.

It's more like looking at the interminable self-important dribblings of L'Academie Francaise than it is like looking at any kind of engineering document.

When people have that mindset I just don't expect any positive result from interacting with them. No matter what anyone does, if they are involved then it will result in more hair rather than less.

A little unfair

Given the international politics of the effort, some of what you seek just wasn't possible. Unicode wasn't going to happen at all without being compromised by compromise.

The originators usually weren't unaware of these issues. They felt that some approach to writing mixed-language text that the stakeholders would actually agree to support was better than none.

30 years later, with the benefit of experience from Unicode, I'm not entirely convinced that the politics would allow us to do any better today. There were (and probably still are) cases where identical glyph shapes with subtly different horizontal expansion were held separate because one country was unwilling to accept any code point from another country's code point space. This was true even when the historical ancestry of the glyphs showed that they really were the same in historical terms.

It's worth noting, however, that if those glyphs had been assigned common code points, we'd have been left needing some kind of "begin/end language X" marker to deal with them, and Unicode-enabled typefaces might well have been made impractical. At which point the goal of multilingual text would have been defeated.

Why is it a concern for language design?

Unicode isn't really worth getting upset over. Language is messy, and the ways that we have chosen to record it are equally messy. I doubt there exists a perfectly good representation for text.

If we skip unicode for a second and consider the same set of issues (readability, deceptive formatting etc) then they have always existed when programming in an ascii-only language. Programmers are quite used to the problems around 0 and O, or 1l|. Normally they are solved by picking an appropriate typeface to make the differences more obvious.

There is a similar split in responsbility here - the visual presentation of code depends on the editor used to view it. Highlighting ambiguous sequences, choosing explicit rendering and other approaches to solving these problems are more tractable in the context of building an editor than they are in designing a language. The set of rules to encode workarounds to these problems are very complex. The set of cases for an editor to handle are smpler, or can be avoided by providing forms of UI that don't have a good match when parsing a language, e.g. switching from renering unicode as glyphs to hex-codes or character names.

Why try to solve difficult problems, when their difficulty arises from the place that you are trying to solve them, rather than trust in a separation of concerns and allow them to be solved in an easier context?

PS Apologies for doing the classic stackoverflow thing of "don't solve X do Y instead" ... but doing Y is easier :)

And now we're back to the 'editor-of-choice' religious wars....

Code is for communicating among humans, as much as it is for communicating to computers. Code doesn't just exist in IDEs where we get to make our own rules about how things display. It exists in office documents, academic papers, thesis papers, security reports, and PDF files. I don't think it's reasonable to go to each and every display engine that people use and hack them all to enforce consistent display of code.

If I'm going to hack a programming editor, I'm not going to force it to break unicode standards when displaying a language. I'm going to hack it to insert the formatting codes needed for that code to display unambiguously when the code is pasted into ANY standard-conforming system that displays text, including the editor.

And that means the compiler accepts those formatting codes in source code and checks that they're correct. In fact if the rules for those codes are kept straightforward (like setting off identifiers with LRMs etc) it's fairly easy to have an option for the compiler to "write back" the source with correct formatting inserted.

I think the correct way of dealing with unicode's visual ambiguity requires the language to rule it out as a syntax error.

I can shut up about it if nobody else is interested, but as long as known sources of visual ambiguity are not formally recognized as programming language syntax errors, they will continue to rear their ugly heads in bug reports and security advisories.

Mixing formal- and display- semantics never ends well

Programming languages rely on the property that what we have written is what we see. Unicode is designed to allow a mapping from written form to display form. If you attempt to constrain the behaviour of unicode to be compatible with what we need from a programming language then you will end up with something unknown, and more complex.

Considering your emotional response to the intricacies and edge-cases in unicode, would you inflict something more subtle on the programmers of the world?

Anyway, I wouldn't take my objections too seriously, perhaps you will find a nice solution. But before you go too far down that rabbit hole, have you seen and the code in the linked ?

It's not unicode's behavior I'm trying to restrict.

Unicode's behavior is out of my hands. I can't constrain it to be compatible with what we need from a programming language. Especially when the programs will exist as a sequence of codepoints and will be rendered by literally any of the hundreds or thousands of rendering engines out there.

For me to wave my hands and say "Programming language code should be rendered in a special mode where identifiers, comments, whitespace, and string contents are all treated as isolates for bidi processing, so that when they change it doesn't reorder the rendering of anything else" would be silly. Those hundreds, or thousands, of implementation teams, even if they agree with me that it's a good idea, would not find it a high priority. Even if some of them implement it, their efforts at it would differ from one another enough to undermine the results. And it would introduce yet another mode in their display engines with rules about switching into and out of that mode that would not quite match up.

Unicode behavior at this point simply is what it is and calling for further modes and complications in the rendering rules would be adding hair to a hairball.

What I'm looking at is constraining programming language syntax so that correct programs reliably AVOID INVOKING problematic behaviors of standard unicode rendering.

I can say instead, "if identifiers, string contents, whitespace, and comments conform to these syntactic rules (whose application is BTW easy to check, correct, and automate), then they will be treated as bidi isolates by Unicode rendering, and therefore spelling them differently will not RESULT in your code being reordered by Unicode's standard behavior."


I can say instead, "if identifiers, string contents, whitespace, and comments conform to these syntactic rules (whose application is BTW easy to check, correct, and automate), then they will be treated as bidi isolates by Unicode rendering, and therefore spelling them differently will not RESULT in your code being reordered by Unicode's standard behavior."

Now it is easier to understand what you want to achieve. BTW, leading with or as examples would have been helpful for understanding.

Why add the LRM as a prefix and suffix in the regions that you want to allow bidi behaviour - would it not be simpler to say every span that is *not* an identifer / string / comment MUST start with an LRM? i.e. everything that will define the code semantics after lexing must be in a well-defined direction.

Simpler yes but also more intrusive.

Ideally 99% or more of programs (those which do not poke this problem now) should already be compliant with the new recommendations. Mandating insertion of LRMs into programs that already render in a consistent way is very heavy-handed.

Remember the thing I want to prevent is already regarded as something like anathema. For fear of confusion and bugs that Unicode bidi rendering can cause, people don't get good use out of Unicode even when they have it available. Therefore there are very few *existing* programs that need any additions, and mandating the insertion of LRMs into the source code of perfectly valid programs whose rendering would not be visibly changed by them is obnoxious.

And as it turns out, wrong. :-( Here I must give thanks to a person who responded privately to this topic but who wishes to remain anonymous. My initial suggestion was mistaken.

Using LRMs as I suggested above would prevent the spelling of an identifier from changing the rendering order of things outside the identifier in an LTR context, but would have side effects on the rendering order of leading and trailing bidi-neutral and bidi-weak characters within the identifier which probably aren't consistent with the intent of anyone who'd be using an RTL character in an identifier and would have side effects on the rendering order of surrounding text in an RTL context (such as a comment in an RTL language that mentions the variable).

Bidi-weak 'L' characters such as ascii digits render in LTR order with respect to each other even in an RTL context, but in an LTR context leading digits would go on the left side and in an RTL context they'd go on the right side. Bidi-neutral characters such as pretty much everything else that gets allowed in identifiers that's not actually a letter (examples to include Underscore, spacing accents, etc), would additionally be rendered in different sequence relative to each other.

The suggested revision would be to use U+2068 'First Strong Isolate' at the beginning of an identifier and U+202C 'Pop Directional Formatting' at the end. This makes the display of the identifier and the display of the text surrounding it nearly independent of one another's content. (Nearly independent but not completely, because the text inside an FSI/Pop pair will be rendered in a direction inherited from the document's top level if it contains no bidi-strong characters).

The Right Answer

I think I found the right answer.

The Unicode bidi algorithm is applied separately to every paragraph. Treating individual tokens separately where there is any application of bidi direction changes is what we want, so making them technically 'paragraphs' would do the right thing.

Paragraphs are separated by codepoints with the bidi class B (Forced Boundary). Most 'B' codepoints also force a line break, which we don't want. But there are three control characters from the ASCII range, 0x1c, ox1d, and 0x1e, that have the 'B' bidi class and do not create line breaks.

These characters are 'File Separator', 'Group Separator', and 'Record Separator' in Ascii, but lack descriptive unicode names.

Testing in three different text rendering systems consistently gives good results if a 'Group Separator' character is inserted at token boundaries - the tokens are rendered in the same visual location regardless of how they or surrounding tokens are spelled, but the characters within the tokens are located according to the bidi rules.

Testing in GCC treats the 'Group Separator' U+1D as whitespace, which is not problematic if it is encountered at a token boundary. It is a nonprinting, nonspacing character, so if encountered elsewhere it *CREATES* a token boundary with no visual clue that a token boundary exists.

For my next trick I will hack an emacs mode and add 'Group Separator Insertion at token boundaries' to the default syntax coloring (or maybe 'electric autoindent' since syntax coloring doesn't actually insert codepoints) behavior.

Ideally it would be 'Group Separator Insertion at the boundaries of any token containing at least one RTL character' meaning no change to the vast majority of existing code. But that might be a couple more tricks down the line.


Duplicate comment deleted.

First time I attempted to post this I got an error message that said 'try again in a few minutes.'

So a few minutes later I tried again, and the post went up. And after that there was one copy of the post showing.

But today there are two. I don't know how long the dupe has been showing here but I'm deleting one of them.

About identifiers...

In the course of optimizing my tokenizer I discovered a strange thing. Explicitly defining what constitutes a legal identifier isn't technically necessary.

Syntax is predefined. It consists of semantic syntax like operators, symbols, delimiters, data constant values, etc, and asemic syntax like whitespace, comments, bidi separators to prevent any token from affecting the rendering or placement of any other token, etc.

All syntax has the property that it cannot be a substring of a keyword or identifier. If a complete syntactic form is read, there is no possible prefix or postfix of it which changes the fact that a syntactic form begins at its location. The next character is either part of a longer syntactic form, or the first character of a token following the syntactic form.

Keywords are also a predefined set. But keywords have a different property: because they are made of a subset of the characters legal in identifiers, and the tokenizer makes single tokens from sequences of such characters of maximal length, keywords and identifiers can't immediately follow each other in either order. At the very least they must be separated by some asemic syntax like whitespace.

Some of these properties are specific to the language I've defined, but they seem reasonable to me, and reasonably common.

These lead to a peculiar parsing strategy: My parser is now just scanning for syntactic forms. Any nonsyntax characters found between syntactic forms MUST be either exactly one keyword or exactly one identifier. There is no other possibility. So after a syntactic form I track keyword lexing for as long as a keyword could possibly be completed, and if the lexer accepts characters past the end of a keyword or just never completes one before the beginning of the next complete syntax token, then those characters must be an identifier.

The syntax of identifiers is therefore literally any substring found between syntax forms which is not a keyword. This allows some damned peculiar identifiers, including strings that could have been the beginning of a syntax form, strings of emoji, and other abstract symbols.

This is both amusing and interesting, but impractical. I will need to explicitly add back some checks to limit the set of identifiers for three reasons. First, the first version of the language probably won't be the last. I will need to add both syntax and keywords in the future and it would be awkward to convert existing identifiers into either. Second, I need to impose some structure on identifiers to prevent deceptive identifiers that contain invisible characters or otherwise look like something they aren't. Third, seeing an identifier which contains the initial few characters of an incomplete syntax form is confusing and probably shouldn't be allowed. It doesn't happen often because most initial substrings are complete syntax forms after a single character, but when it happens it's still confusing.

Anyway the discovery that taking time to check identifier syntax isn't strictly necessary is interesting and amusing so I thought I'd share it. All that really

be done to detect an identifier is to check any maximal nonsyntax substring and find that it is not a keyword.